一,实践:Section 3.6 Lab: Linear Regression

In [1]:
# 先加载两个基本数据集
library(MASS)
library(ISLR2)
Attaching package: ‘ISLR2’


The following object is masked from ‘package:MASS’:

    Boston


In [ ]:
# ISLR2包含Boston数据集,记录了波士顿506个人口普查区的房价中位数(medv),我们将使用12个预测指标来预测medv,如rm(每栋房子的平均房间数)、age(1940年之前建造的自住单元的比例)和lstat(社会经济地位较低的家庭的百分比)

head(Boston)
A data.frame: 6 × 13
crimzninduschasnoxrmagedisradtaxptratiolstatmedv
<dbl><dbl><dbl><int><dbl><dbl><dbl><dbl><int><dbl><dbl><dbl><dbl>
10.00632182.3100.5386.57565.24.0900129615.34.9824.0
20.02731 07.0700.4696.42178.94.9671224217.89.1421.6
30.02729 07.0700.4697.18561.14.9671224217.84.0334.7
40.03237 02.1800.4586.99845.86.0622322218.72.9433.4
50.06905 02.1800.4587.14754.26.0622322218.75.3336.2
60.02985 02.1800.4586.43058.76.0622322218.75.2128.7

我们将首先使用lm()函数来拟合一个简单的线性回归lm()模型, 其中medv作为响应变量,lstat作为预测器。基本语法是lm(y ~ x, data),其中y是响应,x是预测,data是保存这两个变量的数据集。

In [5]:
lm.fit <- lm(medv ~ lstat, data = Boston)

# 上面是正常写法,下面是使用attach函数的写法
# attach(Boston) 将Boston数据集中的变量放入到搜索路径中
# lm.fit <- lm(medv ~ lstat)

打印lm.fit的话会输出一些关于模型的基本信息。

要获得更详细的信息,我们使用summary(lm.fit),这为我们提供了系数的P值和标准误差SE,以及模型的R2统计量和F统计量

In [6]:
lm.fit
Call:
lm(formula = medv ~ lstat, data = Boston)

Coefficients:
(Intercept)        lstat  
      34.55        -0.95  
In [7]:
summary(lm.fit)
Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,	Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
In [ ]:
# 访问lm对象的各个部分
names(x = lm.fit) #拟合模型的各个部分
coef(lm.fit) #所估计的系数
confint(lm.fit) #所估计的系数的置信区间
  1. 'coefficients'
  2. 'residuals'
  3. 'effects'
  4. 'rank'
  5. 'fitted.values'
  6. 'assign'
  7. 'qr'
  8. 'df.residual'
  9. 'xlevels'
  10. 'call'
  11. 'terms'
  12. 'model'
(Intercept)
34.5538408793831
lstat
-0.950049353757991
A matrix: 2 × 2 of type dbl
2.5 %97.5 %
(Intercept)33.44845735.6592247
lstat-1.026148-0.8739505

对于给定的lstat指,可以使用predict函数生成medv预测值的置信区间和预测区间

In [12]:
predict(lm.fit, data.frame(lstat = (c(5, 10, 15))),interval = "confidence") #预测值和置信区间
predict(lm.fit, data.frame(lstat = (c(5, 10, 15))),interval = "prediction") #预测值和预测区间
A matrix: 3 × 3 of type dbl
fitlwrupr
129.8035929.0074130.59978
225.0533524.4741325.63256
320.3031019.7315920.87461
A matrix: 3 × 3 of type dbl
fitlwrupr
129.8035917.56567542.04151
225.0533512.82762637.27907
320.30310 8.07774232.52846

image.png image-2.png image-3.png

image.png

In [ ]:
# 使用plot()和abline()函数绘制medv和lstat以及最小二乘回归线
plot(Boston$lstat, Boston$medv)
# 或者attach(Boston)之后直接使用plot(lstat, medv)
abline(lm.fit)

# lstat与medv之间存在一定的非线性关系
No description has been provided for this image

image.png

image.png

abline()函数可以用来绘制任何直线,而不仅仅是最小二乘回归直线。

为了绘制截距为a、斜率为b的直线,我们输入abline(a, b)。

下面我们对绘制直线和点进行一些额外的设置。lwd = 3命令使回归线的宽度增加3倍;这也适用于plot()和lines()函数。我们也可以使用pch选项来创建不同的绘图符号

In [19]:
plot(Boston$lstat, Boston$medv)
abline(lm.fit, lwd = 3) 
No description has been provided for this image

image.png

In [18]:
plot(Boston$lstat, Boston$medv)
abline(lm.fit, lwd = 3, col = "red") #
No description has been provided for this image

image.png

In [ ]:
attach(Boston)
plot(lstat, medv, col = "red")
plot(lstat, medv, pch = 20)
plot(lstat, medv, pch = "+")
plot(1:20, 1:20, pch = 1:20) #pch参数可以设置点的形状
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

接下来,我们研究一些诊断图,其中一些在第3.3.3节中讨论。

通过直接将plot()函数应用于lm()的输出,可以自动生成四个诊断图。

通常,该命令一次生成一个图形,点击Enter将生成下一个图形。然而,将所有四个图一起查看通常很方便。我们可以通过使用将显示屏幕分割成单独的面板来实现这一点,这样可以同时查看多个plot。例如,par()和par(mfrow = c(2,2)) mfrow()函数,它们将绘制区域划分为2 × 2网格的面板。

In [21]:
par(mfrow = c(2, 2))
plot(lm.fit)
No description has been provided for this image

image.png

我们可以使用residuals()函数从线性回归拟合中计算残差

函数rstudent()将返回学生化的残差,我们可以使用这个函数来绘制残差与拟合值的关系

In [22]:
predict(lm.fit)
1
29.8225950976683
2
25.8703897860351
3
30.7251419837384
4
31.7606957793346
5
29.490077823853
6
29.604083746304
7
22.7447274121713
8
16.3603957549176
9
6.11886372140644
10
18.3079969301215
11
15.1253315950322
12
21.9466859550146
13
19.6285655318451
14
26.7064332173421
15
24.8063345098261
16
26.5069228530529
17
28.3025161316555
18
20.6166168597534
19
23.4477639339522
20
23.837284168993
21
14.5838034633901
22
21.4146583169101
23
16.7689169770335
24
15.6668597266742
25
19.0680364131278
26
18.8685260488387
27
20.4836099502273
28
18.136988046445
29
22.3932091512808
30
23.1722496213624
31
13.0827254844525
32
22.1651973063789
33
8.22797328674918
34
17.120435237924
35
15.2298370239456
36
25.3573631350058
37
23.7137777530044
38
26.2219080469255
39
24.9298409258147
40
30.4496276711486
41
32.6727431589423
42
29.9556020071944
43
29.0340541340492
44
27.4854736874236
45
25.4808695509943
46
24.853836977514
47
21.1106425237075
48
16.6929130287329
49
5.28282029009941
50
19.1630413485037
51
21.7756770713381
52
25.5948754734452
53
29.5375802915409
54
26.5449248272032
55
20.4931104437648
56
29.9841034878072
57
29.0720561081995
58
30.801145932039
59
28.0365023126033
60
25.7943858377344
61
22.0606918774655
62
20.8351282111177
63
28.1600087285918
64
25.5283720186822
65
26.9059435816313
66
30.1171103973333
67
24.8253354969013
68
26.8584411139434
69
22.117694838691
70
26.2029070598504
71
28.1695092221294
72
25.1673532642542
73
29.309568446639
74
27.3904687520479
75
28.1125062609039
76
26.0603996567867
77
23.1817501149
78
24.7968340162885
79
22.8302318540095
80
25.9083917601854
81
29.5280797980033
82
27.6944845452504
83
28.1695092221294
84
27.4189702326606
85
25.4143660962312
86
28.3500185993434
87
22.3362061900553
88
26.5354243336657
89
29.3285694337142
90
29.1385595629626
91
26.1839060727752
92
26.7634361785676
93
26.8014381527179
94
28.654034392546
95
24.492818223086
96
28.2360126768925
97
23.7802812077675
98
30.554133100062
99
31.1621646864671
100
28.6730353796211
101
25.6043759669828
102
27.2669623360593
103
24.4548162489357
104
21.7851775648757
105
22.8397323475471
106
18.906528022989
107
16.825919938259
108
21.167645484933
109
22.8967353087726
110
19.7805734284463
111
22.2031992805292
112
24.9013394452019
113
19.1535408549661
114
18.317497423659
115
24.6258251326121
116
19.5810630641572
117
23.1152466601369
118
24.7683325356758
119
19.9515823121228
120
21.6236691747368
121
20.9016316658808
122
20.9966366012566
123
17.5194559665023
124
10.4130868003926
125
17.8519732403176
126
20.4836099502273
127
8.65549549594027
128
18.2224924882832
129
19.9325813250476
130
17.1299357314616
131
22.5832190220324
132
22.9062358023101
133
23.9892920655942
134
20.2745990924005
135
18.1084865658323
136
18.4410038396476
137
18.4980068008731
138
20.692620808054
139
14.2987886572627
140
17.0159298090106
141
11.60064849259
142
1.86264261657064
143
9.07351721159379
144
9.45353695309698
145
6.72689530781155
146
8.14246884491096
147
18.7355191393126
148
6.49888346290963
149
7.6484431809568
150
14.1752822412742
151
21.1581449913954
152
21.937185461477
153
23.0392427118363
154
19.5525615835444
155
20.1890946505623
156
20.2840995859381
157
19.2200443097291
158
30.1931143456339
159
28.4450235347192
160
27.5329761551116
161
29.3285694337142
162
32.9102554973818
163
32.7297461201678
164
31.3996770249066
165
23.4952664016401
166
25.2338567190172
167
31.0386582704785
168
23.0202417247611
169
24.0082930526694
170
23.7992821948426
171
20.8446287046553
172
23.1247471536745
173
20.5976158726782
174
25.9653947214109
175
25.3953651091561
176
29.490077823853
177
24.9488419128898
178
28.5780304442453
179
27.9794993513778
180
29.7655921364428
181
27.3714677649727
182
25.5758744863701
183
29.9746029942696
184
29.1575605500377
185
21.2721509138464
186
22.0606918774655
187
30.32612125516
188
28.2075111962797
189
30.2216158262467
190
29.4330748626275
191
29.7085891752173
192
30.0981094102581
193
31.8271992340977
194
29.7750926299804
195
30.3926247099231
196
31.7321942987219
197
30.6776395160505
198
26.3739159435268
199
28.2645141575052
200
30.2216158262467
201
⋯
202
28.4070215605689
203
27.3999692455854
204
30.2406168133218
205
25.0818488224159
206
22.5452170478821
207
28.8725457439103
208
23.4192624533394
209
27.048450984695
210
25.7373828765089
211
23.6282733111662
212
17.1394362249991
213
19.4100541804807
214
24.7113295744503
215
22.4597126060439
216
27.7134855323256
217
28.0270018190657
218
27.2384608554466
219
23.4002614662643
220
28.7395388343842
221
29.7275901622925
222
28.7110373537715
223
22.4027096448184
224
25.0818488224159
225
27.5804786227995
226
25.917892253723
227
22.7447274121713
228
27.114954439458
229
29.1575605500377
230
28.1410077415167
231
26.9439455557816
232
25.2433572125548
233
24.5213197036987
234
26.4689208789026
235
25.3003601737803
236
25.7278823829714
237
29.3380699272517
238
26.3359139693765
239
27.7324865194007
240
30.1741133585588
241
24.5498211843115
242
22.5167155672694
243
28.5115269894823
244
28.8630452503727
245
28.9580501857485
246
28.8725457439103
247
29.3380699272517
248
27.1529564136084
249
30.2786187874721
250
26.9059435816313
251
29.2620659789511
252
17.8329722532425
253
21.9466859550146
254
23.6472742982414
255
22.5167155672694
256
27.1529564136084
257
21.0726405495572
258
24.8728379645892
259
20.6451183403661
260
29.5280797980033
261
27.7894894806262
262
21.2531499267712
263
21.8896829937891
264
31.4566799861321
265
31.0101567898658
266
31.7416947922594
267
25.4998705380695
268
26.1174026180121
269
1.52062484921776
270
-1.51953308280781
271
21.7851775648757
272
12.4746938980474
273
14.3747926055634
274
12.0471716888563
275
13.8617659545341
276
18.2034915012081
277
14.5268005021647
278
12.1326761306945
279
11.2206287510868
280
5.45382917377584
281
5.28282029009941
282
7.68644515510712
283
4.16176205266498
284
5.46332966731342
285
14.745311853529
286
18.2984964365839
287
16.7309150028832
288
10.1565734748779
289
20.1415921828744
290
19.02053394544
291
18.2889959430463
292
16.1513848970908
293
15.6288577525239
294
5.49183114792616
295
6.08086174725612
296
9.12101967928169
297
15.2488380110207
298
15.2583385045583
299
15.7713651555876
300
8.54148957348932
301
12.7217067300245
302
12.3796889626716
303
23.0297422182987
304
9.47253794017215
305
15.76186466205
306
24.9488419128898
307
14.3937935926385
308
1.90064459072096
309
15.4768498559226
310
-0.578984222587394
311
6.95490715271347
312
10.0520680459645
313
9.24452609527023
314
14.9638232048933
315
12.9497185749264
316
20.2840995859381
317
19.6380660253826
318
21.1581449913954
319
12.4271914303595
320
18.250993968896
321
11.3821371412257
322
19.6475665189202
323
20.7591242628171
324
14.1087787865111
325
11.6766524408907
326
17.7949702790921
327
15.8473691038883
328
23.1247471536745
329
19.1440403614285
330
20.1415921828744
331
12.4461924174347
332
17.4054500440514
333
9.42503547248425
334
2.23316186453625
335
12.8167116654003
336
13.5482496677939
337
16.0088774940271
338
18.792522100538
339
16.645410561045
340
11.9521667534805
341
11.7716573762665
342
17.6524628760284
343
18.9350295036017
344
17.3294460957507
345
16.2083878583163
346
17.9849801498437
347
17.7094658372539
348
18.1464885399826
349
18.6500146974743
350
16.7784174705711
351
17.3294460957507
352
16.4934026644437
353
18.4600048267227
354
19.1345398678909
355
20.5881153791406
356
18.9540304906769
357
20.6356178468285
358
21.2626504203088
359
24.7778330292134
360
21.9941884227025
361
21.1296435107827
362
18.2604944624336
363
14.2987886572627
364
17.3294460957507
365
20.5311124179152
366
19.0775369066654
367
22.3267056965178
368
20.9111321594184
369
23.4762654145649
370
17.3199456022131
371
11.6576514538155
372
16.8069189511838
373
10.8881114772716
374
17.4244510311265
375
22.0986938516158
376
24.3503108200223
377
27.2004588812963
378
27.8939949095396
379
24.6543266132248
380
21.8801825002515
381
24.5023187166236
382
20.3221015600884
383
23.6757757788541
384
17.3959495505138
385
11.7811578698041
386
6.35637605984594
387
17.3864490569762
388
21.8706820067139
389
23.1437481407496
390
21.642670161812
391
17.8329722532425
392
14.4697975409392
393
21.1581449913954
394
22.2792032288299
395
20.2080956376374
396
20.9396336400311
397
25.3668636285433
398
25.9273927472605
399
29.195562524188
400
28.3975210670313
401
27.0674519717701
In [23]:
residuals(lm.fit)
1
-5.82259509766824
2
-4.27038978603508
3
3.97485801626161
4
1.63930422066539
5
6.70992217614699
6
-0.904083746303968
7
0.155272587828725
8
10.7396042450824
9
10.3811362785936
10
0.592003069878544
11
-0.125331595032187
12
-3.04668595501456
13
2.07143446815494
14
-6.3064332173421
15
-6.60633450982611
16
-6.60692285305292
17
-5.20251613165552
18
-3.11661685975337
19
-3.24776393395219
20
-5.63728416899296
21
-0.983803463390132
22
-1.81465831691008
23
-1.56891697703351
24
-1.16685972667424
25
-3.46803641312785
26
-4.96852604883867
27
-3.88360995022725
28
-3.33698804644502
29
-3.99320915128082
30
-2.17224962136237
31
-0.382725484452505
32
-7.6651973063789
33
4.97202671325083
34
-4.02043523792397
35
-1.72983702394557
36
-6.45736313500575
37
-3.71377775300442
38
-5.22190804692552
39
-0.229840925814653
40
0.350372328851421
41
2.22725684105772
42
-3.35560200719442
43
-3.73405413404917
44
-2.78547368742365
45
-4.28086955099429
46
-5.55383697751401
47
-1.11064252370753
48
-0.0929130287328689
49
9.1171797099006
50
0.23695865149635
51
-2.07567707133812
52
-5.09487547344525
53
-4.53758029154091
54
-3.14492482720324
55
-1.59311044376484
56
5.41589651219283
57
-4.37205610819949
58
0.798854067960964
59
-4.73650231260328
60
-6.19438583773442
61
-3.36069187746552
62
-4.83512821111771
63
-5.96000872859182
64
-0.528372018682187
65
6.09405641836873
66
-6.61711039733328
67
-5.42533549690128
68
-4.85844111394337
69
-4.717694838691
70
-5.30290705985036
71
-3.9695092221294
72
-3.46735326425415
73
-6.50956844663899
74
-3.99046875204785
75
-4.01250626090392
76
-4.66039965678666
77
-3.18175011489995
78
-3.99683401628853
79
-1.63023185400949
80
-5.60839176018538
81
-1.52807979800333
82
-3.79448454525041
83
-3.3695092221294
84
-4.51897023266059
85
-1.51436609623123
86
-1.75001859934342
87
0.163793809944662
88
-4.33542433366566
89
-5.72856943371415
90
-0.438559562962553
91
-3.5839060727752
92
-4.76343617856757
93
-3.9014381527179
94
-3.65403439254598
95
-3.89281822308598
96
0.163987323107539
97
-2.38028120776749
98
8.14586689993804
99
12.6378353135329
100
4.52696462037887
101
1.89562403301717
102
-0.76696233605931
103
-5.85481624893565
104
-2.4851775648757
105
-2.73973234754707
106
0.593471977011009
107
2.67408006174101
108
-0.76764548493301
109
-3.09673530877255
110
-0.380573428446343
111
-0.50319928052922
112
-2.10133944520191
113
-0.353540854966067
114
0.382502576340963
115
-6.1258251326121
116
-1.28106306415716
117
-1.91524666013689
118
-5.56833253567579
119
0.448417687877217
120
-2.32366917473684
121
1.09836833411923
122
-0.69663660125657
123
2.98054403349768
124
6.88691319960745
125
0.948026759682378
126
0.916390049772743
127
7.04450450405973
128
-2.02249248828324
129
-1.93258132504762
130
-2.82993573146155
131
-3.38321902203242
132
-3.30623580231013
133
-0.989292065594243
134
-1.8745990924005
135
-2.50848656583228
136
-0.341003839647572
137
-1.09800680087306
138
-3.59262080805401
139
-0.998788657262733
140
0.784070190989413
141
2.39935150740996
142
12.5373573834294
143
4.32648278840622
144
6.14646304690302
145
5.07310469218845
146
5.65753115508905
147
-3.13551913931255
148
8.10111653709037
149
10.1515568190432
150
1.2247177587258
151
0.341855008604571
152
-2.33718546147698
153
-7.73924271183625
154
-0.152561583544427
155
-3.18909465056228
156
-4.68409958593808
157
-6.12004430972913
158
11.1068856543661
159
-4.14502353471922
160
-4.23297615511155
161
-2.32856943371415
162
17.0897445026182
163
17.2702538798322
164
18.6003229750934
165
-0.795266401640087
166
-0.23385671901721
167
18.9613417295215
168
0.77975827523891
169
-0.208293052669401
170
-1.49928219484264
171
-3.44462870465529
172
-4.02474715367447
173
2.50238412732179
174
-2.36539472141086
175
-2.79536510915607
176
-0.0900778238530104
177
-1.74884191288981
178
-3.97803044424534
179
1.9205006486222
180
7.43440786355718
181
12.4285322350273
182
10.6241255136299
183
7.92539700573042
184
3.34243944996229
185
5.12784908615361
186
7.53930812253448
187
19.67387874484
188
3.79248880372028
189
-0.421615826246661
190
5.46692513737247
191
7.29141082478265
192
0.401890589741878
193
4.57280076590233
194
1.3249073700196
195
-1.2926247099231
196
18.2678057012781
197
2.6223604839495
198
3.9260840564732
199
6.3354858424948
200
4.67838417375334
201
⋯
202
4.9929784394311
203
0.800030754414571
204
-7.44061681332182
205
-4.78184882241593
206
-6.44521704788209
207
-6.77254574391031
208
-4.01926245333945
209
-5.44845098469497
210
-1.93738287650895
211
-7.42827331116621
212
0.660563775000873
213
0.389945819519275
214
-1.61132957445031
215
-1.45971260604388
216
-3.91348553232557
217
-4.9270018190657
218
-6.83846085544657
219
-4.90026146626429
220
-3.7395388343842
221
-5.1275901622925
222
-5.71103735377146
223
-0.202709644818399
224
-5.78184882241593
225
-4.98047862279945
226
-6.11789225372296
227
-5.64472741217127
228
-7.71495443945803
229
-6.95756055003771
230
-7.44100774151666
231
-5.84394555578159
232
-5.74335721255479
233
-6.02131970369872
234
-5.8689208789026
235
-6.30036017378027
236
-7.02788238297137
237
3.36193007274827
238
-9.83591396937648
239
-3.83248651940073
240
1.02588664144124
241
-7.04982118431146
242
-5.31671556726936
243
-5.41152698948228
244
-4.36304525037273
245
-2.35805018574853
246
-5.97254574391032
247
-5.23806992725173
248
-8.55295641360835
249
-0.17861878747214
250
-8.70594358163127
251
-8.66206597895109
252
-0.0329722532424578
253
-0.246685955014563
254
-0.947274298241366
255
0.0832844327306455
256
-2.15295641360835
257
-1.17264054955721
258
-4.07283796458917
259
-3.84511834036611
260
-7.62807979800333
261
-0.289489480626205
262
0.646850073228771
263
1.21031700621092
264
18.5433200138679
265
18.9898432101342
266
18.2583052077405
267
24.5001294619305
268
23.8825973819879
269
12.2793751507822
270
15.3195330828078
271
-6.7851775648757
272
1.42530610195261
273
-1.07479260556337
274
1.05282831114371
275
-3.66176595453406
276
-7.80349150120808
277
-3.62680050216465
278
-0.832676130694513
279
1.07937124891316
280
3.34617082622416
281
1.9171797099006
282
2.81355484489288
283
3.23823794733503
284
4.73667033268658
285
-3.24531185352899
286
-3.19849643658388
287
6.46908499711681
288
-0.456573474877894
289
-6.34159218287438
290
-6.32053394543995
291
-5.18899594304629
292
-3.65138489709082
293
-7.12885775252392
294
-0.49183114792616
295
0.219138252743885
296
-3.52101967928168
297
-8.04883801102073
298
-3.15833850455831
299
-7.47136515558762
300
-0.0414895734893102
301
-7.72170673002447
302
-0.479688962671593
303
4.87025778170133
304
7.72746205982786
305
11.73813533795
306
-9.94884191288981
307
2.80620640736146
308
15.999355409279
309
0.823150144077357
310
7.5789842225874
311
0.245092847286537
312
-2.55206804596452
313
1.15547390472978
314
-6.16382320489333
315
-4.54971857492639
316
-3.58409958593808
317
-5.43806602538264
318
-0.358144991395428
319
0.972808569640508
320
-6.55099396889598
321
-3.0821371412257
322
-9.44756651892022
323
-9.85912426281707
324
-3.10877878651114
325
-2.17665244089068
326
-3.29497027909214
327
-1.74736910388826
328
-7.02474715367447
329
-4.84404036142849
330
-8.44159218287438
331
0.953807582565348
332
-7.80545004405136
333
-0.725035472484243
334
6.16683813546375
335
-0.0167116654002687
336
-3.04824966779392
337
1.09112250597288
338
-0.392522100538032
339
-1.24541056104497
340
-1.1521667534805
341
0.0283426237335231
342
-2.75246287602844
343
-6.33502950360173
344
-3.22944609575073
345
-3.2083878583163
346
-4.58498014984374
347
-2.50946583725392
348
-2.0464885399826
349
-0.850014697474333
350
-1.87841747057109
351
-3.22944609575073
352
-3.79340266444369
353
-4.96000482672273
354
-4.23453986789091
355
-0.588115379140635
356
-2.55403049067689
357
-2.93561784682853
358
-1.76265042030881
359
-4.57783302921338
360
-0.594188422702462
361
-1.22964351078269
362
0.739505537566442
363
4.80121134273727
364
1.77055390424928
365
-0.431112417915154
366
0.822463093334569
367
-2.72670569651776
368
2.28886784058165
369
6.32373458543508
370
-3.51994560221314
371
1.64234854618448
372
-0.106918951183831
373
1.11188852272845
374
-2.82445103112652
375
-0.698693851615842
376
-1.35031082002228
377
-3.50045888129625
378
-2.89399490953958
379
-2.85432661322484
380
-1.2801825002515
381
-3.30231871662356
382
-1.2221015600884
383
-3.0757757788541
384
-2.19594955051379
385
-4.78115786980406
386
1.74362394015407
387
-3.7864490569762
388
-1.77068200671392
389
-1.34374814074963
390
2.857329838188
391
5.26702774675754
392
5.23020245906083
393
-2.85814499139543
394
-1.07920322882986
395
-2.70809563763744
396
-4.13963364003109
397
-2.96686362854333
398
-5.32739274726054
399
-5.29556252418803
400
-6.39752106703132
401
-15.1674519717701
In [24]:
rstudent(lm.fit)
1
-0.938639094864759
2
-0.687511237980931
3
0.640666347497302
4
0.264236181293794
5
1.08188171555607
6
-0.145609733121283
7
0.024980406942384
8
1.73437879342761
9
1.68464086525454
10
0.0952794075069099
11
-0.0201873355661056
12
-0.490273791133087
13
0.333351306075818
14
-1.0160056473743
15
-1.06414618772446
16
-1.0644855697941
17
-0.838170499781684
18
-0.501568661733041
19
-0.522653922923515
20
-0.907706171314735
21
-0.158494972972672
22
-0.291976161437516
23
-0.252605464147521
24
-0.187922873024383
25
-0.558257512514491
26
-0.800080140062439
27
-0.625097216836159
28
-0.537234871671181
29
-0.642694395922744
30
-0.349518620546228
31
-0.0616922494431119
32
-1.23505454754158
33
0.803970819135612
34
-0.647488997056111
35
-0.278639407928494
36
-1.04016085356967
37
-0.597705381713567
38
-0.840941446675589
39
-0.036981606345384
40
0.0564446368737533
41
0.359165483675643
42
-0.540650973006434
43
-0.601501580145319
44
-0.448455773129773
45
-0.689163599538104
46
-0.894319282545907
47
-0.178694441401907
48
-0.0149589191291224
49
1.4794858705669
50
0.0381314696551221
51
-0.333977592207178
52
-0.820381277981058
53
-0.731170741203512
54
-0.506262368097963
55
-0.256340498952637
56
0.873018618280602
57
-0.704375891600158
58
0.1287121420649
59
-0.762951011976889
60
-0.997771831970615
61
-0.540830339214282
62
-0.778395675895519
63
-0.960384952830883
64
-0.0850219324765685
65
0.981759898912627
66
-1.06709365156242
67
-0.873592790591082
68
-0.782420661418683
69
-0.759424000566502
70
-0.854001800681364
71
-0.639314908989401
72
-0.558086499234794
73
-1.04944931507711
74
-0.642579871579307
75
-0.646236651447894
76
-0.750389335714497
77
-0.512020308202478
78
-0.643349155389836
79
-0.262291326007357
80
-0.903235442588003
81
-0.246113107598036
82
-0.611036956159752
83
-0.542619457064744
84
-0.727772494046973
85
-0.243690320960155
86
-0.281772080751924
87
0.0263513078420176
88
-0.698065093811022
89
-0.923316630282838
90
-0.0706225850927437
91
-0.57693795832973
92
-0.767089184178313
93
-0.62815910651346
94
-0.588541594911102
95
-0.626575681977055
96
0.0264009337714837
97
-0.383010062219545
98
1.31458665056151
99
2.04490270889228
100
0.729279176130483
101
0.305059628056721
102
-0.123450978871037
103
-0.942834136459934
104
-0.399885545990373
105
-0.440856181349915
106
0.0955061434621353
107
0.430589599898271
108
-0.123506199698595
109
-0.498329426694678
110
-0.0612371578224353
111
-0.0809557845134348
112
-0.338144334982185
113
-0.0568920983768528
114
0.0615611068291001
115
-0.986573434264438
116
-0.206145782948235
117
-0.308157617977023
118
-0.896649645697288
119
0.0721525005074296
120
-0.373892198671318
121
0.176721919963297
122
-0.112082505388073
123
0.479881886471635
124
1.11286365810549
125
0.15259484627943
126
0.147446031766052
127
1.13952674395143
128
-0.325544731183282
129
-0.310990306732437
130
-0.45566286253898
131
-0.544455135692867
132
-0.532060886869863
133
-0.159169315590762
134
-0.301647303771584
135
-0.403802958058385
136
-0.0548809192591364
137
-0.176715636642752
138
-0.578217717691548
139
-0.160925319075748
140
0.12622661479639
141
0.387065239338443
142
2.04429419156623
143
0.69912086424294
144
0.993487054197097
145
0.821145206970403
146
0.915038277113709
147
-0.504729222338138
148
1.31283945757636
149
1.64545552575292
150
0.197338427163905
151
0.0550002888803532
152
-0.376063834265179
153
-1.24702508871445
154
-0.0245489336227306
155
-0.51325898051222
156
-0.754089401399369
157
-0.9857810285316
158
1.79486848319044
159
-0.66765302450966
160
-0.681686237418824
161
-0.375047834366134
162
2.776857474875
163
2.80641553490693
164
3.02465385544699
165
-0.127947675768017
166
-0.0376290101600851
167
3.08403360632422
168
0.125450685410765
169
-0.0335119482952013
170
-0.241227905385803
171
-0.554377710193746
172
-0.647779179584251
173
0.402682775792712
174
-0.380697943272136
175
-0.449890315756566
176
-0.0145069316695349
177
-0.281412518706953
178
-0.640754481247913
179
0.309198678276889
180
1.19911407015378
181
2.00852742128645
182
1.71454660209093
183
1.2786412968589
184
0.538398549975597
185
0.825559098495008
186
1.21471296328102
187
3.20142553205642
188
0.610788667777058
189
-0.0679167677303175
190
0.881106085020471
191
1.17596476503487
192
0.0647366288625731
193
0.737447464660288
194
0.213403102901679
195
-0.208244836401842
196
2.97001811452456
197
0.422566777182913
198
0.632083106135286
199
1.02103653551995
200
0.754048566254773
201
⋯
202
0.804388978350195
203
0.128777422611268
204
-1.20030295890086
205
-0.769868267606131
206
-1.03802122837741
207
-1.09180864406126
208
-0.646901763556554
209
-0.877607396632731
210
-0.31178670729707
211
-1.19679729115933
212
0.106339868681343
213
0.0627482035185011
214
-0.259275723491083
215
-0.234852531718946
216
-0.63021760908267
217
-0.793672372587439
218
-1.10203002226353
219
-0.788857951536903
220
-0.602337010663921
221
-0.826412851534179
222
-0.920326588292935
223
-0.0326121219947374
224
-0.931119771731159
225
-0.802216119638659
226
-0.985444032272186
227
-0.908874127026561
228
-1.2436572447646
229
-1.12179839961244
230
-1.19963958253274
231
-0.941401012325245
232
-0.924927401325345
233
-0.969702005023086
234
-0.945350419750092
235
-1.01481118715438
236
-1.13234017445797
237
0.541568283829525
238
-1.58685978034957
239
-0.617166384035214
240
0.165257981314676
241
-1.13573328346223
242
-0.85597984281965
243
-0.871940156041158
244
-0.702881872669362
245
-0.379757278916723
246
-0.962586354237621
247
-0.844144245158181
248
-1.37924008883763
249
-0.0287735776198971
250
-1.40393891153959
251
-1.39762753256173
252
-0.0053071275653646
253
-0.0396873725061245
254
-0.152405779234805
255
0.0133988627945913
256
-0.346570037550887
257
-0.188670613248351
258
-0.65559841350995
259
-0.618888461862466
260
-1.23035463813468
261
-0.0466011426802584
262
0.104070505480749
263
0.19472575844683
264
3.01528395936497
265
3.08872424957903
266
2.96845786558646
267
4.00470254715351
268
3.90101959002739
269
2.00252183767457
270
2.51153744826928
271
-1.0929128169075
272
0.229817986413809
273
-0.173167152969462
274
0.16978747888872
275
-0.590268911700114
276
-1.25791280754907
277
-0.584488453546384
278
-0.134277350657285
279
0.174136220392089
280
0.541909292011534
281
0.310464422627529
282
0.454906151622062
283
0.524945339812025
284
0.76731884233673
285
-0.522934029405484
286
-0.514911063302858
287
1.04263188424416
288
-0.073697979200531
289
-1.02142339440823
290
-1.01817018246058
291
-0.835713290468071
292
-0.588155892888511
293
-1.14958310297145
294
-0.0796266096622211
295
0.0354626565259144
296
-0.568856133454207
297
-1.29855744526259
298
-0.50882724127053
299
-1.20491999711317
300
-0.00670321574465496
301
-1.24678127191027
302
-0.0773452117295426
303
0.784013399406782
304
1.24973132802751
305
1.89705886958375
306
-1.60487152215782
307
0.452202403995784
308
2.61554152680479
309
0.132573648052529
310
1.23546787820642
311
0.0396386501267494
312
-0.412033536057047
313
0.186611501515953
314
-0.993841470310478
315
-0.733807662066663
316
-0.576866139413773
317
-0.875704860059964
318
-0.0576211647379782
319
0.156855620546657
320
-1.05551241227763
321
-0.497312911081858
322
-1.52370853505663
323
-1.59023312199085
324
-0.501034325983144
325
-0.351117397176642
326
-0.530502227107625
327
-0.281411371107686
328
-1.13159260478024
329
-0.779976965468237
330
-1.36075464001696
331
0.153790479091726
332
-1.25843504117571
333
-0.117080203009323
334
1.00205503645968
335
-0.00269407220158684
336
-0.491377992269896
337
0.175707232492936
338
-0.0631685425433482
339
-0.20052042870578
340
-0.185816694023616
341
0.00457120589849792
342
-0.443132208090307
343
-1.02052349759131
344
-0.519998816277143
345
-0.516750459377634
346
-0.738363817634674
347
-0.403992816743916
348
-0.329412713277216
349
-0.136797987662696
350
-0.302444583880949
351
-0.519998816277143
352
-0.610992338801498
353
-0.798760747322518
354
-0.681740240956423
355
-0.0946249869794545
356
-0.411076797985518
357
-0.472425760429892
358
-0.283609020401765
359
-0.736962684664491
360
-0.0955949332581354
361
-0.197842038256283
362
0.119020699139877
363
0.77401415024766
364
0.28503739418057
365
-0.0693640027553692
366
0.132354829597394
367
-0.438758741606478
368
0.368305255166795
369
1.01843496425756
370
-0.566804306824535
371
0.264916833516439
372
-0.0172133721367554
373
0.179412287645274
374
-0.454748986254468
375
-0.112408245269015
376
-0.217264652947644
377
-0.563598889403136
378
-0.46597725787251
379
-0.459347016517151
380
-0.205967270780368
381
-0.531473032865701
382
-0.196640862747838
383
-0.494967482280201
384
-0.35353098051867
385
-0.771576050351409
386
0.28213345512461
387
-0.609739941079116
388
-0.28489428822031
389
-0.216194964942242
390
0.459793755622783
391
0.84837328753545
392
0.843215665546847
393
-0.459935866439256
394
-0.173628446760455
395
-0.435813419050483
396
-0.666317591168703
397
-0.4775020260264
398
-0.857914768513885
399
-0.853387417647027
400
-1.03108965348273
401
-2.45582479352552
In [25]:
plot(predict(lm.fit), residuals(lm.fit))
plot(predict(lm.fit), rstudent(lm.fit))
No description has been provided for this image
No description has been provided for this image

根据残差图,有一些非线性的证据。可以使用hatvalues()函数为任意数量的预测变量计算统计信息

In [27]:
plot(hatvalues(lm.fit))
which.max(hatvalues(lm.fit))
375: 375
No description has been provided for this image

Multiple Linear Regression 多元线性回归

为了使用最小二乘法拟合多元线性回归模型,我们再次使用lm()函数。语法lm(y ~ x1 + x2 + x3)用于拟合一个具有三个预测因子x1、x2和x3的模型。summary()函数现在输出所有预测变量的回归系数。

In [2]:
lm.fit <- lm(medv ~ lstat + age, data = Boston)
summary(lm.fit)
Call:
lm(formula = medv ~ lstat + age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.981  -3.978  -1.283   1.968  23.158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
age          0.03454    0.01223   2.826  0.00491 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared:  0.5513,	Adjusted R-squared:  0.5495 
F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16
In [3]:
# 对所有的12个变量进行的多元线性回归

lm.fit <- lm(medv ~ ., data = Boston)
summary(lm.fit)
Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.1304  -2.7673  -0.5814   1.9414  26.2526 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
crim         -0.121389   0.033000  -3.678 0.000261 ***
zn            0.046963   0.013879   3.384 0.000772 ***
indus         0.013468   0.062145   0.217 0.828520    
chas          2.839993   0.870007   3.264 0.001173 ** 
nox         -18.758022   3.851355  -4.870 1.50e-06 ***
rm            3.658119   0.420246   8.705  < 2e-16 ***
age           0.003611   0.013329   0.271 0.786595    
dis          -1.490754   0.201623  -7.394 6.17e-13 ***
rad           0.289405   0.066908   4.325 1.84e-05 ***
tax          -0.012682   0.003801  -3.337 0.000912 ***
ptratio      -0.937533   0.132206  -7.091 4.63e-12 ***
lstat        -0.552019   0.050659 -10.897  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.798 on 493 degrees of freedom
Multiple R-squared:  0.7343,	Adjusted R-squared:  0.7278 
F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16
In [7]:
?summary.lm  # 查看summary对象中的参数
summary.lm                package:stats                R Documentation

_S_u_m_m_a_r_i_z_i_n_g _L_i_n_e_a_r _M_o_d_e_l _F_i_t_s

_D_e_s_c_r_i_p_t_i_o_n:

     ‘summary’ method for class ‘"lm"’.

_U_s_a_g_e:

     ## S3 method for class 'lm'
     summary(object, correlation = FALSE, symbolic.cor = FALSE, ...)
     
     ## S3 method for class 'summary.lm'
     print(x, digits = max(3, getOption("digits") - 3),
           symbolic.cor = x$symbolic.cor,
           signif.stars = getOption("show.signif.stars"), ...)
     
_A_r_g_u_m_e_n_t_s:

  object: an object of class ‘"lm"’, usually, a result of a call to
          ‘lm’.

       x: an object of class ‘"summary.lm"’, usually, a result of a
          call to ‘summary.lm’.

correlation: logical; if ‘TRUE’, the correlation matrix of the
          estimated parameters is returned and printed.

  digits: the number of significant digits to use when printing.

symbolic.cor: logical. If ‘TRUE’, print the correlations in a symbolic
          form (see ‘symnum’) rather than as numbers.

signif.stars: logical. If ‘TRUE’, ‘significance stars’ are printed for
          each coefficient.

     ...: further arguments passed to or from other methods.

_D_e_t_a_i_l_s:

     ‘print.summary.lm’ tries to be smart about formatting the
     coefficients, standard errors, etc. and additionally gives
     ‘significance stars’ if ‘signif.stars’ is ‘TRUE’.

     Aliased coefficients are omitted in the returned object but
     restored by the ‘print’ method.

     Correlations are printed to two decimal places (or symbolically):
     to see the actual correlations print ‘summary(object)$correlation’
     directly.

_V_a_l_u_e:

     The function ‘summary.lm’ computes and returns a list of summary
     statistics of the fitted linear model given in ‘object’, using the
     components (list elements) ‘"call"’ and ‘"terms"’ from its
     argument, plus

residuals: the _weighted_ residuals, the usual residuals rescaled by
          the square root of the weights specified in the call to ‘lm’.

coefficients: a p x 4 matrix with columns for the estimated
          coefficient, its standard error, t-statistic and
          corresponding (two-sided) p-value.  Aliased coefficients are
          omitted.

 aliased: named logical vector showing if the original coefficients are
          aliased.

   sigma: the square root of the estimated variance of the random error

                       sigma^2 = 1/(n-p) Sum(w[i] R[i]^2),              
          
          where R[i] is the i-th residual, ‘residuals[i]’.

      df: degrees of freedom, a 3-vector (p, n-p, p*), the first being
          the number of non-aliased coefficients, the last being the
          total number of coefficients.

fstatistic: (for models including non-intercept terms) a 3-vector with
          the value of the F-statistic with its numerator and
          denominator degrees of freedom.

r.squared: R^2, the ‘fraction of variance explained by the model’,

                    R^2 = 1 - Sum(R[i]^2) / Sum((y[i]- y*)^2),          
          
          where y* is the mean of y[i] if there is an intercept and
          zero otherwise.

adj.r.squared: the above R^2 statistic ‘_adjusted_’, penalizing for
          higher p.

cov.unscaled: a p x p matrix of (unscaled) covariances of the coef[j],
          j=1, ..., p.

correlation: the correlation matrix corresponding to the above
          ‘cov.unscaled’, if ‘correlation = TRUE’ is specified.

symbolic.cor: (only if ‘correlation’ is true.)  The value of the
          argument ‘symbolic.cor’.

na.action: from ‘object’, if present there.

_S_e_e _A_l_s_o:

     The model fitting function ‘lm’, ‘summary’.

     Function ‘coef’ will extract the matrix of coefficients with
     standard errors, t-statistics and p-values.

_E_x_a_m_p_l_e_s:

     ##-- Continuing the  lm(.) example:
     coef(lm.D90)  # the bare coefficients
     sld90 <- summary(lm.D90 <- lm(weight ~ group -1))  # omitting intercept
     sld90
     coef(sld90)  # much more
     
     ## model with *aliased* coefficient:
     lm.D9. <- lm(weight ~ group + I(group != "Ctl"))
     Sm.D9. <- summary(lm.D9.)
     Sm.D9. #  shows the NA NA NA NA  line
     stopifnot(length(cc <- coef(lm.D9.)) == 3, is.na(cc[3]),
               dim(coef(Sm.D9.)) == c(2,4), Sm.D9.$df == c(2, 18, 3))
     
In [9]:
summary(object = lm.fit)
Call:
lm(formula = medv ~ ., data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.1304  -2.7673  -0.5814   1.9414  26.2526 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.617270   4.936039   8.431 3.79e-16 ***
crim         -0.121389   0.033000  -3.678 0.000261 ***
zn            0.046963   0.013879   3.384 0.000772 ***
indus         0.013468   0.062145   0.217 0.828520    
chas          2.839993   0.870007   3.264 0.001173 ** 
nox         -18.758022   3.851355  -4.870 1.50e-06 ***
rm            3.658119   0.420246   8.705  < 2e-16 ***
age           0.003611   0.013329   0.271 0.786595    
dis          -1.490754   0.201623  -7.394 6.17e-13 ***
rad           0.289405   0.066908   4.325 1.84e-05 ***
tax          -0.012682   0.003801  -3.337 0.000912 ***
ptratio      -0.937533   0.132206  -7.091 4.63e-12 ***
lstat        -0.552019   0.050659 -10.897  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.798 on 493 degrees of freedom
Multiple R-squared:  0.7343,	Adjusted R-squared:  0.7278 
F-statistic: 113.5 on 12 and 493 DF,  p-value: < 2.2e-16
In [ ]:
summary(lm.fit)$r.sq # 模型的R方,未矫正前的 
summary(lm.fit)$sigma # 模型的RSE,残差标准误差                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
0.734307043761308
4.79803433559637
In [ ]:
# car包中的vif函数,可以用于计算方差膨胀因子,variance infation factors
# install.packages("car")
library(car)
vif(lm.fit)

# 使用vif函数取不同自变量的方差膨胀因子,并进行开根号。如果大于2,代表自变量与其他变量存在共线性,此时需要对模型进行优化
Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Loading required package: carData

crim
1.76748591543102
zn
2.29845890773582
indus
3.9871806307571
chas
1.07116777375841
nox
4.3690926228448
rm
1.91253243743689
age
3.0882320397312
dis
3.95403664162831
rad
7.44530076006986
tax
9.00215766347181
ptratio
1.79705959312978
lstat
2.87077650084175
In [12]:
print(vif(lm.fit))
    crim       zn    indus     chas      nox       rm      age      dis 
1.767486 2.298459 3.987181 1.071168 4.369093 1.912532 3.088232 3.954037 
     rad      tax  ptratio    lstat 
7.445301 9.002158 1.797060 2.870777 

如果我们想使用除一个变量之外的所有变量执行回归呢?例如,在上述回归输出中,年龄具有较高的p值。因此,我们可能希望运行一个回归排除这个预测因素。下面的语法将使用除年龄之外的所有预测因子进行回归。

In [18]:
lm.fit1 <- lm(medv ~ . - age, data = Boston)
summary(lm.fit1)
Call:
lm(formula = medv ~ . - age, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.1851  -2.7330  -0.6116   1.8555  26.3838 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.525128   4.919684   8.441 3.52e-16 ***
crim         -0.121426   0.032969  -3.683 0.000256 ***
zn            0.046512   0.013766   3.379 0.000785 ***
indus         0.013451   0.062086   0.217 0.828577    
chas          2.852773   0.867912   3.287 0.001085 ** 
nox         -18.485070   3.713714  -4.978 8.91e-07 ***
rm            3.681070   0.411230   8.951  < 2e-16 ***
dis          -1.506777   0.192570  -7.825 3.12e-14 ***
rad           0.287940   0.066627   4.322 1.87e-05 ***
tax          -0.012653   0.003796  -3.333 0.000923 ***
ptratio      -0.934649   0.131653  -7.099 4.39e-12 ***
lstat        -0.547409   0.047669 -11.483  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.794 on 494 degrees of freedom
Multiple R-squared:  0.7343,	Adjusted R-squared:  0.7284 
F-statistic: 124.1 on 11 and 494 DF,  p-value: < 2.2e-16
In [17]:
lm.fit2 <- lm(medv ~ . - age - indus, data = Boston)
summary(lm.fit2)
Call:
lm(formula = medv ~ . - age - indus, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.1814  -2.7625  -0.6243   1.8448  26.3920 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.451747   4.903283   8.454 3.18e-16 ***
crim         -0.121665   0.032919  -3.696 0.000244 ***
zn            0.046191   0.013673   3.378 0.000787 ***
chas          2.871873   0.862591   3.329 0.000935 ***
nox         -18.262427   3.565247  -5.122 4.33e-07 ***
rm            3.672957   0.409127   8.978  < 2e-16 ***
dis          -1.515951   0.187675  -8.078 5.08e-15 ***
rad           0.283932   0.063945   4.440 1.11e-05 ***
tax          -0.012292   0.003407  -3.608 0.000340 ***
ptratio      -0.930961   0.130423  -7.138 3.39e-12 ***
lstat        -0.546509   0.047442 -11.519  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.789 on 495 degrees of freedom
Multiple R-squared:  0.7342,	Adjusted R-squared:  0.7289 
F-statistic: 136.8 on 10 and 495 DF,  p-value: < 2.2e-16
In [ ]:
# 或者,也可以使用update函数 # update and (by default) re-fit a model
lm.fit3 <- update(lm.fit, ~ . - age)
summary(lm.fit3)
Call:
lm(formula = medv ~ crim + zn + indus + chas + nox + rm + dis + 
    rad + tax + ptratio + lstat, data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.1851  -2.7330  -0.6116   1.8555  26.3838 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.525128   4.919684   8.441 3.52e-16 ***
crim         -0.121426   0.032969  -3.683 0.000256 ***
zn            0.046512   0.013766   3.379 0.000785 ***
indus         0.013451   0.062086   0.217 0.828577    
chas          2.852773   0.867912   3.287 0.001085 ** 
nox         -18.485070   3.713714  -4.978 8.91e-07 ***
rm            3.681070   0.411230   8.951  < 2e-16 ***
dis          -1.506777   0.192570  -7.825 3.12e-14 ***
rad           0.287940   0.066627   4.322 1.87e-05 ***
tax          -0.012653   0.003796  -3.333 0.000923 ***
ptratio      -0.934649   0.131653  -7.099 4.39e-12 ***
lstat        -0.547409   0.047669 -11.483  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.794 on 494 degrees of freedom
Multiple R-squared:  0.7343,	Adjusted R-squared:  0.7284 
F-statistic: 124.1 on 11 and 494 DF,  p-value: < 2.2e-16

使用lm()函数可以很容易地在线性模型中包括交互项。

语法lstat:age告诉R在lstat和age之间包含一个交互项。

语法lstat * age同时包括lstat、age和交互项lstat×age作为预测符;它是lstat + age + lstat:age的缩写。

多个变量之间的交互作用:https://zh.wikipedia.org/wiki/%E4%BA%A4%E4%BA%92%E4%BD%9C%E7%94%A8_(%E7%BB%9F%E8%AE%A1%E5%AD%A6)

image.png image-2.png

其中一个原因变量对结果的影响依赖于第二个原因变量的状态(即两者的影响原因不是简单累加的)

age (proportion of owner-occupied units built prior to 1940), and lstat (percent of households with low socioeconomic status).

age(1940年之前建造的自住单元的比例)和lstat(社会经济地位较低的家庭的百分比)

来预测medv房假中位数

In [ ]:
summary(lm(medv ~ lstat * age, data = Boston)) 
Call:
lm(formula = medv ~ lstat * age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.806  -4.045  -1.333   2.085  27.552 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
age         -0.0007209  0.0198792  -0.036   0.9711    
lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.149 on 502 degrees of freedom
Multiple R-squared:  0.5557,	Adjusted R-squared:  0.5531 
F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

解读参考如下:

image.png image-2.png

Non-linear Transformations of the Predictors 预测变量的非线性转换

lm()函数还可以适应预测变量的非线性转换。

例如,给定预测器变量X,我们可以在公式对象中创建预测变量X2;这样的包装允许I(X^2)的标准用法。

函数I()是必需的,因为^有一个特殊的意义R,即提高X的2次方。现在我们将medv回归到lstat和lstat2。

In [ ]:
lm.fit2 <- lm(medv ~ lstat + I(lstat^2),data = Boston) # 不能写成medv ~ lstat + lstat^2
summary(lm.fit2)
Call:
lm(formula = medv ~ lstat + I(lstat^2), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-15.2834  -3.8313  -0.5295   2.3095  25.4148 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 42.862007   0.872084   49.15   <2e-16 ***
lstat       -2.332821   0.123803  -18.84   <2e-16 ***
I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.524 on 503 degrees of freedom
Multiple R-squared:  0.6407,	Adjusted R-squared:  0.6393 
F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16

与二次项相关的接近零的p值表明它导致了一个改进的模型。我们使用anova()函数进一步量化二次拟合优于线性拟合的程度

线性拟合:

image.png

二次项拟合(非线性拟合):

image-2.png

In [26]:
lm.fit <- lm(medv ~ lstat,data = Boston)
summary(lm.fit)
anova(lm.fit, lm.fit2)
Call:
lm(formula = medv ~ lstat, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.168  -3.990  -1.318   2.034  24.500 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 34.55384    0.56263   61.41   <2e-16 ***
lstat       -0.95005    0.03873  -24.53   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.216 on 504 degrees of freedom
Multiple R-squared:  0.5441,	Adjusted R-squared:  0.5432 
F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
150419472.38NANANANA
250419472.38 0 0NANA
In [28]:
lm.fit <- lm(medv ~ lstat,data = Boston)
lm.fit2 <- lm(medv ~ lstat + I(lstat^2),data = Boston)
anova(lm.fit, lm.fit2)
A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
150419472.38NA NA NA NA
250315347.24 14125.138135.19987.630116e-28

这里模型1表示线性子模型,只包含一个预测因子lstat,而模型2对应的是较大的二次模型,它有两个预测因子lstat和lstat2。

anova()函数执行一个比较两个模型的假设检验。零假设是两个模型都能很好地拟合数据,备择假设是完整模型更优。这里的f统计量是135,相关的p值几乎为零。

这提供了非常明确的证据,证明包含预测器lstat和lstat2的模型比只包含预测器lstat的模型要好得多。这并不奇怪,因为我们之前看到 medv和lstat之间非线性关系的证据

In [29]:
par(mfrow = c(2, 2))
plot(lm.fit2)
No description has been provided for this image

image.png

然后我们看到,当lstat2项包含在模型中时,在残差中几乎没有可识别的模式。

为了创建一个三次拟合,我们可以包括一个I(X^3)形式的预测器。

然而,对于高阶多项式,这种方法开始变得繁琐。更好的方法是在lm()中使用来创建多项式。

例如,下面的命令poly()函数poly()产生一个五阶多项式拟合:

In [31]:
lm.fit5 <- lm(medv ~ poly(lstat, 5),data = Boston)
summary(lm.fit5)
Call:
lm(formula = medv ~ poly(lstat, 5), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5433  -3.1039  -0.7052   2.0844  27.1153 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.215 on 500 degrees of freedom
Multiple R-squared:  0.6817,	Adjusted R-squared:  0.6785 
F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16

这表明,包括额外的多项式项,最多五阶,导致模型拟合的改进!然而,数据的进一步调查表明,在回归拟合中,没有超过五阶的多项式项有显著的p值

默认情况下,poly()函数将预测函数正交:这意味着该函数输出的特征不只是参数的幂序列。

然而,应用于poly()函数输出的线性模型将具有与应用于原始多项式的线性模型相同的拟合值(尽管系数估计值、标准误和p值会有所不同)。

为了从poly()函数中获得原始的多项式,必须使用参数raw = TRUE。

image.png image-2.png

In [33]:
# 当然,我们并不局限于使用多项式变换来进行预测。这里我们尝试一个对数变换
summary(lm(medv ~ log(lstat), data = Boston))
summary(lm(medv ~ log(rm), data = Boston))
Call:
lm(formula = medv ~ log(lstat), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.4599  -3.5006  -0.6686   2.1688  26.0129 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  52.1248     0.9652   54.00   <2e-16 ***
log(lstat)  -12.4810     0.3946  -31.63   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.329 on 504 degrees of freedom
Multiple R-squared:  0.6649,	Adjusted R-squared:  0.6643 
F-statistic:  1000 on 1 and 504 DF,  p-value: < 2.2e-16
Call:
lm(formula = medv ~ log(rm), data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-19.487  -2.875  -0.104   2.837  39.816 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -76.488      5.028  -15.21   <2e-16 ***
log(rm)       54.055      2.739   19.73   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.915 on 504 degrees of freedom
Multiple R-squared:  0.4358,	Adjusted R-squared:  0.4347 
F-statistic: 389.3 on 1 and 504 DF,  p-value: < 2.2e-16
In [34]:
# 生成正交多项式特征
x <- 1:10
poly_features <- poly(x, degree = 3)
print(poly_features)

# 生成原始多项式特征
raw_poly_features <- poly(x, degree = 3, raw = TRUE)
print(raw_poly_features)
                1           2          3
 [1,] -0.49543369  0.52223297 -0.4534252
 [2,] -0.38533732  0.17407766  0.1511417
 [3,] -0.27524094 -0.08703883  0.3778543
 [4,] -0.16514456 -0.26111648  0.3346710
 [5,] -0.05504819 -0.34815531  0.1295501
 [6,]  0.05504819 -0.34815531 -0.1295501
 [7,]  0.16514456 -0.26111648 -0.3346710
 [8,]  0.27524094 -0.08703883 -0.3778543
 [9,]  0.38533732  0.17407766 -0.1511417
[10,]  0.49543369  0.52223297  0.4534252
attr(,"coefs")
attr(,"coefs")$alpha
[1] 5.5 5.5 5.5

attr(,"coefs")$norm2
[1]    1.0   10.0   82.5  528.0 3088.8

attr(,"degree")
[1] 1 2 3
attr(,"class")
[1] "poly"   "matrix"
       1   2    3
 [1,]  1   1    1
 [2,]  2   4    8
 [3,]  3   9   27
 [4,]  4  16   64
 [5,]  5  25  125
 [6,]  6  36  216
 [7,]  7  49  343
 [8,]  8  64  512
 [9,]  9  81  729
[10,] 10 100 1000
attr(,"degree")
[1] 1 2 3
attr(,"class")
[1] "poly"   "matrix"

Qualitative Predictors 定性预测

In [35]:
# 现在使用的是其他的数据,ISLR2库中的Carseats数据集 
# 我们将尝试基于一些预测因素来预测400个地点的销售Sales(儿童汽车座椅的销售)
head(Carseats)
A data.frame: 6 × 11
SalesCompPriceIncomeAdvertisingPopulationPriceShelveLocAgeEducationUrbanUS
<dbl><dbl><dbl><dbl><dbl><dbl><fct><dbl><dbl><fct><fct>
1 9.50138 7311276120Bad 4217YesYes
211.22111 4816260 83Good 6510YesYes
310.06113 3510269 80Medium5912YesYes
4 7.40117100 4466 97Medium5514YesYes
5 4.15141 64 3340128Bad 3813YesNo
610.8112411313501 72Bad 7816No Yes

我们可以看到这些预测变量中,除了定量的变量之外,还包括了定性的变量,比如说是Shelveloc,它是货架位置质量的一个指标,也就是商店中展示汽车座椅的空间;

然后这一个变量的取值总共有3个可能的值:差,中,好。

给定一个定性变量,如Shelveloc, R会自动生成哑变量。下面我们给出一个包含一些交互项的多元回归模型。

In [36]:
lm.fit <- lm(Sales ~ . + Income:Advertising + Price:Age, data = Carseats) # Income:Advertising表示Income和Advertising的交互项,以及原始数据中的定性项:ShelveLoc	
summary(lm.fit)
Call:
lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9208 -0.7503  0.0177  0.6754  3.3413 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
Income              0.0108940  0.0026044   4.183 3.57e-05 ***
Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
Population          0.0001592  0.0003679   0.433 0.665330    
Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
Age                -0.0579466  0.0159506  -3.633 0.000318 ***
Education          -0.0208525  0.0196131  -1.063 0.288361    
UrbanYes            0.1401597  0.1124019   1.247 0.213171    
USYes              -0.1575571  0.1489234  -1.058 0.290729    
Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
Price:Age           0.0001068  0.0001333   0.801 0.423812    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.011 on 386 degrees of freedom
Multiple R-squared:  0.8761,	Adjusted R-squared:  0.8719 
F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

contrasts()函数的作用是:返回R使用的虚拟变量的编码。

In [37]:
attach(Carseats)
contrasts(ShelveLoc)
A matrix: 3 × 2 of type dbl
GoodMedium
Bad00
Good10
Medium01

R创建了一个ShelveLocGood虚拟变量,如果放置位置良好,该变量的值为1,否则为0。

它还创建了一个ShelveLocMedium虚拟变量,如果搁置位置是中等,则该变量等于1,否则等于0。

一个糟糕的搁置位置对应于两个虚拟变量的零。

事实上,ShelveLocGood在回归输出中的系数是正的,这表明一个好的货架位置与高销量相关(相对于一个不好的位置)。而ShelveLocMedium的正系数较小,表明中等货架位置的销量高于差货架位置,但低于好货架位置

Writing Function 自己手写函数

In [40]:
LoadLibraries <- function(){
    library(ISLR2)
    library(MASS)
    print("the libraries ISLR2 & MASS have been loaded")
}
In [ ]:
LoadLibraries  #该函数的信息
LoadLibraries() #该函数的执行
function () 
{
    library(ISLR2)
    library(MASS)
    print("the libraries ISLR2 & MASS have been loaded")
}
[1] "the libraries ISLR2 & MASS have been loaded"

二,练习部分

Section 3.7 Exercises 3, 4, 9, 13

3,

image.png

$y = \beta_0 + \beta_1 \cdot \text{GPA} + \beta_2 \cdot \text{IQ} + \beta_3 \cdot \text{Level} + \beta_4 \cdot \text{GPA} \cdot \text{IQ} + \beta_5 \cdot \text{GPA} \cdot \text{Level}$

$$ y = \beta_0 + \beta_1 \cdot \text{GPA} + \beta_2 \cdot \text{IQ} + \beta_3 \cdot \text{Level} + \beta_4 \cdot \text{GPA} \cdot \text{IQ} + \beta_5 \cdot \text{GPA} \cdot \text{Level} $$

其中,系数估计值为:

$$ \begin{align*} \hat{\beta}_0 &= 50 \\ \hat{\beta}_1 &= 20 \\ \hat{\beta}_2 &= 0.07 \\ \hat{\beta}_3 &= 35 \\ \hat{\beta}_4 &= 0.01 \\ \hat{\beta}_5 &= -10 \\ \end{align*} $$

假设我们有一个包含五个预测变量的数据集,分别是:X1 = GPA, X2 = IQ, X3 = Level(1 表示 College,0 表示 High School),X4 = GPA 和 IQ 的交互项,X5 = GPA 和 Level 的交互项。响应变量是毕业后的起始薪水(以千美元为单位)。假设我们使用最小二乘法拟合模型,并得到以下系数估计值:

$$ y = 50 + 20 \cdot \text{GPA} + 0.07 \cdot \text{IQ} + 35 \cdot \text{Level} + 0.01 \cdot \text{GPA} \cdot \text{IQ} - 10 \cdot \text{GPA} \cdot \text{Level} $$

a

如果固定IQ和GPA,然后我们只考虑level的话,比如说从高中生变到大学生,也就是level的值从0变到1的话,那么对于salary的影响就是:

$$ y1 = 50 + 20 \cdot \text{GPA} + 0.07 \cdot \text{IQ} + 0.01 \cdot \text{GPA} \cdot \text{IQ} $$

$$ y2 = 50 + 20 \cdot \text{GPA} + 0.07 \cdot \text{IQ} + 35 + 0.01 \cdot \text{GPA} \cdot \text{IQ} - 10 \cdot \text{GPA} $$

$\Delta y = y2 - y1 = 35 - 10 \cdot \text{GPA}$

$\Delta y > 0 \Rightarrow \text{GPA} < \dfrac{35}{10} = 3.5$

所以意思就是如果固定IQ的话,只有GPA低于一定程度,大学生的初始薪水才会比高中生多,也就是只有GPA高于一定程度,高中生的初始薪水水平会比高中生多

In [2]:
?outer
outer                   package:base                   R Documentation

_O_u_t_e_r _P_r_o_d_u_c_t _o_f _A_r_r_a_y_s

_D_e_s_c_r_i_p_t_i_o_n:

     The outer product of the arrays ‘X’ and ‘Y’ is the array ‘A’ with
     dimension ‘c(dim(X), dim(Y))’ where element ‘A[c(arrayindex.x,
     arrayindex.y)] = FUN(X[arrayindex.x], Y[arrayindex.y], ...)’.

_U_s_a_g_e:

     outer(X, Y, FUN = "*", ...)
     X %o% Y
     
_A_r_g_u_m_e_n_t_s:

    X, Y: first and second arguments for function ‘FUN’.  Typically a
          vector or array.

     FUN: a function to use on the outer products, found _via_
          ‘match.fun’ (except for the special case ‘"*"’).

     ...: optional arguments to be passed to ‘FUN’.

_D_e_t_a_i_l_s:

     ‘X’ and ‘Y’ must be suitable arguments for ‘FUN’.  Each will be
     extended by ‘rep’ to length the products of the lengths of ‘X’ and
     ‘Y’ before ‘FUN’ is called.

     ‘FUN’ is called with these two extended vectors as arguments (plus
     any arguments in ‘...’).  It must be a vectorized function (or the
     name of one) expecting at least two arguments and returning a
     value with the same length as the first (and the second).

     Where they exist, the [dim]names of ‘X’ and ‘Y’ will be copied to
     the answer, and a dimension assigned which is the concatenation of
     the dimensions of ‘X’ and ‘Y’ (or lengths if dimensions do not
     exist).

     ‘FUN = "*"’ is handled as a special case _via_ ‘as.vector(X) %*%
     t(as.vector(Y))’, and is intended only for numeric vectors and
     arrays.

     ‘%o%’ is binary operator providing a wrapper for ‘outer(x, y,
     "*")’.

_A_u_t_h_o_r(_s):

     Jonathan Rougier

_R_e_f_e_r_e_n_c_e_s:

     Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S
     Language_.  Wadsworth & Brooks/Cole.

_S_e_e _A_l_s_o:

     ‘%*%’ for usual (_inner_) matrix vector multiplication;
     ‘kronecker’ which is based on ‘outer’; ‘Vectorize’ for vectorizing
     a non-vectorized function.

_E_x_a_m_p_l_e_s:

     x <- 1:9; names(x) <- x
     # Multiplication & Power Tables
     x %o% x
     y <- 2:8; names(y) <- paste(y,":", sep = "")
     outer(y, x, `^`)
     
     outer(month.abb, 1999:2003, FUN = paste)
     
     ## three way multiplication table:
     x %o% x %o% y[1:3]
     
In [3]:
# a
# 固定IQ和GPA的取值,然后考虑预测变量的取值对于响应变量的影响 
 
library(plotly)   #用于构建交互式的图表
salary_model <- function(GPA,IQ,LEVEL){
  50+20*GPA+0.07*IQ+35*LEVEL+0.01*GPA*IQ-10*GPA*LEVEL
}
gpa_test <- seq(1,5,length=10)  # 人工构建从1到5,长度为10的等间隔的IQ值
iq_test <- seq(1,200,length=20) # 人工构建从1到200,长度为20的等间隔的GPA值
In [22]:
print(gpa_test)
print(iq_test)
 [1] 1.000000 1.444444 1.888889 2.333333 2.777778 3.222222 3.666667 4.111111
 [9] 4.555556 5.000000
 [1]   1.00000  23.11111  45.22222  67.33333  89.44444 111.55556 133.66667
 [8] 155.77778 177.88889 200.00000
In [ ]:
outer(gpa_test,iq_test,LEVEL=1,FUN=salary_model)   
t(outer(gpa_test,iq_test,LEVEL=1,FUN=salary_model)) 
A matrix: 10 × 20 of type dbl
95.08000 95.91789 96.75579 97.59368 98.43158 99.26947100.1074100.9453101.7832102.6211103.4589104.2968105.1347105.9726106.8105107.6484108.4863109.3242110.1621111.0000
99.52889100.41333101.29778102.18222103.06667103.95111104.8356105.7200106.6044107.4889108.3733109.2578110.1422111.0267111.9111112.7956113.6800114.5644115.4489116.3333
103.97778104.90877105.83977106.77076107.70175108.63275109.5637110.4947111.4257112.3567113.2877114.2187115.1497116.0807117.0117117.9427118.8737119.8047120.7357121.6667
108.42667109.40421110.38175111.35930112.33684113.31439114.2919115.2695116.2470117.2246118.2021119.1796120.1572121.1347122.1123123.0898124.0674125.0449126.0225127.0000
112.87556113.89965114.92374115.94784116.97193117.99602119.0201120.0442121.0683122.0924123.1165124.1406125.1647126.1888127.2129128.2370129.2611130.2851131.3092132.3333
117.32444118.39509119.46573120.53637121.60702122.67766123.7483124.8189125.8896126.9602128.0309129.1015130.1722131.2428132.3135133.3841134.4547135.5254136.5960137.6667
121.77333122.89053124.00772125.12491126.24211127.35930128.4765129.5937130.7109131.8281132.9453134.0625135.1796136.2968137.4140138.5312139.6484140.7656141.8828143.0000
126.22222127.38596128.54971129.71345130.87719132.04094133.2047134.3684135.5322136.6959137.8596139.0234140.1871141.3509142.5146143.6784144.8421146.0058147.1696148.3333
130.67111131.88140133.09170134.30199135.51228136.72257137.9329139.1432140.3535141.5637142.7740143.9843145.1946146.4049147.6152148.8255150.0358151.2461152.4564153.6667
135.12000136.37684137.63368138.89053140.14737141.40421142.6611143.9179145.1747146.4316147.6884148.9453150.2021151.4589152.7158153.9726155.2295156.4863157.7432159.0000
A matrix: 20 × 10 of type dbl
95.08000 99.52889103.9778108.4267112.8756117.3244121.7733126.2222130.6711135.1200
95.91789100.41333104.9088109.4042113.8996118.3951122.8905127.3860131.8814136.3768
96.75579101.29778105.8398110.3818114.9237119.4657124.0077128.5497133.0917137.6337
97.59368102.18222106.7708111.3593115.9478120.5364125.1249129.7135134.3020138.8905
98.43158103.06667107.7018112.3368116.9719121.6070126.2421130.8772135.5123140.1474
99.26947103.95111108.6327113.3144117.9960122.6777127.3593132.0409136.7226141.4042
100.10737104.83556109.5637114.2919119.0201123.7483128.4765133.2047137.9329142.6611
100.94526105.72000110.4947115.2695120.0442124.8189129.5937134.3684139.1432143.9179
101.78316106.60444111.4257116.2470121.0683125.8896130.7109135.5322140.3535145.1747
102.62105107.48889112.3567117.2246122.0924126.9602131.8281136.6959141.5637146.4316
103.45895108.37333113.2877118.2021123.1165128.0309132.9453137.8596142.7740147.6884
104.29684109.25778114.2187119.1796124.1406129.1015134.0625139.0234143.9843148.9453
105.13474110.14222115.1497120.1572125.1647130.1722135.1796140.1871145.1946150.2021
105.97263111.02667116.0807121.1347126.1888131.2428136.2968141.3509146.4049151.4589
106.81053111.91111117.0117122.1123127.2129132.3135137.4140142.5146147.6152152.7158
107.64842112.79556117.9427123.0898128.2370133.3841138.5312143.6784148.8255153.9726
108.48632113.68000118.8737124.0674129.2611134.4547139.6484144.8421150.0358155.2295
109.32421114.56444119.8047125.0449130.2851135.5254140.7656146.0058151.2461156.4863
110.16211115.44889120.7357126.0225131.3092136.5960141.8828147.1696152.4564157.7432
111.00000116.33333121.6667127.0000132.3333137.6667143.0000148.3333153.6667159.0000
In [18]:
?add_surface
?plot_ly
add_trace                package:plotly                R Documentation

_A_d_d _t_r_a_c_e(_s) _t_o _a _p_l_o_t_l_y _v_i_s_u_a_l_i_z_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     Add trace(s) to a plotly visualization

_U_s_a_g_e:

     add_trace(p, ..., data = NULL, inherit = TRUE)
     
     add_markers(p, x = NULL, y = NULL, z = NULL, ..., data = NULL, inherit = TRUE)
     
     add_text(
       p,
       x = NULL,
       y = NULL,
       z = NULL,
       text = NULL,
       ...,
       data = NULL,
       inherit = TRUE
     )
     
     add_paths(p, x = NULL, y = NULL, z = NULL, ..., data = NULL, inherit = TRUE)
     
     add_lines(p, x = NULL, y = NULL, z = NULL, ..., data = NULL, inherit = TRUE)
     
     add_segments(
       p,
       x = NULL,
       y = NULL,
       xend = NULL,
       yend = NULL,
       ...,
       data = NULL,
       inherit = TRUE
     )
     
     add_polygons(p, x = NULL, y = NULL, ..., data = NULL, inherit = TRUE)
     
     add_sf(p, ..., x = ~x, y = ~y, data = NULL, inherit = TRUE)
     
     add_table(p, ..., rownames = TRUE, data = NULL, inherit = TRUE)
     
     add_ribbons(
       p,
       x = NULL,
       ymin = NULL,
       ymax = NULL,
       ...,
       data = NULL,
       inherit = TRUE
     )
     
     add_image(p, z = NULL, colormodel = NULL, ..., data = NULL, inherit = TRUE)
     
     add_area(p, r = NULL, theta = NULL, t = NULL, ..., data = NULL, inherit = TRUE)
     
     add_pie(p, values = NULL, labels = NULL, ..., data = NULL, inherit = TRUE)
     
     add_bars(p, x = NULL, y = NULL, ..., data = NULL, inherit = TRUE)
     
     add_histogram(p, x = NULL, y = NULL, ..., data = NULL, inherit = TRUE)
     
     add_histogram2d(
       p,
       x = NULL,
       y = NULL,
       z = NULL,
       ...,
       data = NULL,
       inherit = TRUE
     )
     
     add_histogram2dcontour(
       p,
       x = NULL,
       y = NULL,
       z = NULL,
       ...,
       data = NULL,
       inherit = TRUE
     )
     
     add_heatmap(p, x = NULL, y = NULL, z = NULL, ..., data = NULL, inherit = TRUE)
     
     add_contour(p, z = NULL, ..., data = NULL, inherit = TRUE)
     
     add_boxplot(p, x = NULL, y = NULL, ..., data = NULL, inherit = TRUE)
     
     add_surface(p, z = NULL, ..., data = NULL, inherit = TRUE)
     
     add_mesh(p, x = NULL, y = NULL, z = NULL, ..., data = NULL, inherit = TRUE)
     
     add_scattergeo(p, ...)
     
     add_choropleth(p, z = NULL, ..., data = NULL, inherit = TRUE)
     
_A_r_g_u_m_e_n_t_s:

       p: a plotly object

     ...: Arguments (i.e., attributes) passed along to the trace
          ‘type’. See ‘schema()’ for a list of acceptable attributes
          for a given trace ‘type’ (by going to ‘traces’ -> ‘type’ ->
          ‘attributes’). Note that attributes provided at this level
          may override other arguments (e.g. ‘plot_ly(x = 1:10, y =
          1:10, color = I("red"), marker = list(color = "blue"))’).

    data: A data frame (optional) or crosstalk::SharedData object.

 inherit: inherit attributes from ‘plot_ly()’?

       x: the x variable.

       y: the y variable.

       z: a numeric matrix (unless ‘add_image()’, which wants a raster
          object, see ‘as.raster()’).

    text: textual labels.

    xend: "final" x position (in this context, x represents "start")

    yend: "final" y position (in this context, y represents "start")

rownames: whether or not to display the rownames of ‘data’.

    ymin: a variable used to define the lower boundary of a polygon.

    ymax: a variable used to define the upper boundary of a polygon.

colormodel: Sets the colormodel for image traces if ‘z’ is not a raster
          object. If ‘z’ is a raster object (see ‘as.raster()’), the
          ‘'rgba'’ colormodel is always used.

       r: Sets the radial coordinates.

   theta: Sets the angular coordinates.

       t: Deprecated. Use ‘theta’ instead.

  values: the value to associated with each slice of the pie.

  labels: the labels (categories) corresponding to ‘values’.

_A_u_t_h_o_r(_s):

     Carson Sievert

_R_e_f_e_r_e_n_c_e_s:

     <https://plotly-r.com/overview.html>

     <https://plotly.com/r/>

     <https://plotly.com/r/reference/>

_S_e_e _A_l_s_o:

     ‘plot_ly()’

_E_x_a_m_p_l_e_s:

     # the `plot_ly()` function initiates an object, and if no trace type
     # is specified, it sets a sensible default
     p <- plot_ly(economics, x = ~date, y = ~uempmed)
     p
     
     # some `add_*()` functions are a specific case of a trace type
     # for example, `add_markers()` is a scatter trace with mode of markers
     add_markers(p)
     
     # scatter trace with mode of text
     add_text(p, text = "%")
     
     # scatter trace with mode of lines 
     add_paths(p)
     
     # like `add_paths()`, but ensures points are connected according to `x`
     add_lines(p)
     
     # if you prefer to work with plotly.js more directly, can always
     # use `add_trace()` and specify the type yourself
     add_trace(p, type = "scatter", mode = "markers+lines")
     
     # mappings provided to `plot_ly()` are "global", but can be overwritten
     plot_ly(economics, x = ~date, y = ~uempmed, color = I("red"), showlegend = FALSE) %>% 
       add_lines() %>%
       add_markers(color = ~pop)
     
     # a number of `add_*()` functions are special cases of the scatter trace
     plot_ly(economics, x = ~date) %>% 
       add_ribbons(ymin = ~pce - 1e3, ymax = ~pce + 1e3)
     
     # use `group_by()` (or `group2NA()`) to apply visual mapping
     # once per group (e.g. one line per group)
     txhousing %>% 
       group_by(city) %>% 
       plot_ly(x = ~date, y = ~median) %>%
       add_lines(color = I("black"))
     
     ## Not run:
     
     # use `add_sf()` or `add_polygons()` to create geo-spatial maps
     # http://blog.cpsievert.me/2018/03/30/visualizing-geo-spatial-data-with-sf-and-plotly/
     if (requireNamespace("sf", quietly = TRUE)) {
       nc <- sf::st_read(system.file("shape/nc.shp", package = "sf"), quiet = TRUE)
       plot_ly() %>% add_sf(data = nc)
     }
     
     # univariate summary statistics
     plot_ly(mtcars, x = ~factor(vs), y = ~mpg) %>% 
       add_boxplot()
     plot_ly(mtcars, x = ~factor(vs), y = ~mpg) %>% 
       add_trace(type = "violin")
       
     # `add_histogram()` does binning for you...
     mtcars %>%
       plot_ly(x = ~factor(vs)) %>%
       add_histogram()
       
     # ...but you can 'pre-compute' bar heights in R
     mtcars %>%
       dplyr::count(vs) %>%
       plot_ly(x = ~vs, y = ~n) %>%
       add_bars()
     
     # the 2d analogy of add_histogram() is add_histogram2d()/add_histogram2dcontour()
     library(MASS)
     (p <- plot_ly(geyser, x = ~waiting, y = ~duration))
     add_histogram2d(p)
     add_histogram2dcontour(p)
     
     # the 2d analogy of add_bars() is add_heatmap()/add_contour()
     # (i.e., bin counts must be pre-specified)
     den <- kde2d(geyser$waiting, geyser$duration)
     p <- plot_ly(x = den$x, y = den$y, z = den$z)
     add_heatmap(p)
     add_contour(p)
     
     # `add_table()` makes it easy to map a data frame to the table trace type
     plot_ly(economics) %>% 
       add_table()
     
     # pie charts!
     ds <- data.frame(labels = c("A", "B", "C"), values = c(10, 40, 60))
     plot_ly(ds, labels = ~labels, values = ~values) %>%
       add_pie() %>%
       layout(title = "Basic Pie Chart using Plotly")
       
     data(wind)
     plot_ly(wind, r = ~r, theta = ~t) %>% 
       add_area(color = ~nms) %>%
       layout(
         polar = list(
           radialaxis = list(ticksuffix = "%"), 
           angularaxis = list(rotation = 90)
         )
       )
     
     # ------------------------------------------------------------
     # 3D chart types
     # ------------------------------------------------------------
     plot_ly(z = ~volcano) %>% 
       add_surface()
     plot_ly(x = c(0, 0, 1), y = c(0, 1, 0), z = c(0, 0, 0)) %>% 
       add_mesh()
     ## End(Not run)
     
plot_ly                 package:plotly                 R Documentation

_I_n_i_t_i_a_t_e _a _p_l_o_t_l_y _v_i_s_u_a_l_i_z_a_t_i_o_n

_D_e_s_c_r_i_p_t_i_o_n:

     This function maps R objects to plotly.js, an (MIT licensed)
     web-based interactive charting library. It provides abstractions
     for doing common things (e.g. mapping data values to fill colors
     (via ‘color’) or creating animations (via ‘frame’)) and sets some
     different defaults to make the interface feel more 'R-like' (i.e.,
     closer to ‘plot()’ and ‘ggplot2::qplot()’).

_U_s_a_g_e:

     plot_ly(
       data = data.frame(),
       ...,
       type = NULL,
       name,
       color,
       colors = NULL,
       alpha = NULL,
       stroke,
       strokes = NULL,
       alpha_stroke = 1,
       size,
       sizes = c(10, 100),
       span,
       spans = c(1, 20),
       symbol,
       symbols = NULL,
       linetype,
       linetypes = NULL,
       split,
       frame,
       width = NULL,
       height = NULL,
       source = "A"
     )
     
_A_r_g_u_m_e_n_t_s:

    data: A data frame (optional) or crosstalk::SharedData object.

     ...: Arguments (i.e., attributes) passed along to the trace
          ‘type’. See ‘schema()’ for a list of acceptable attributes
          for a given trace ‘type’ (by going to ‘traces’ -> ‘type’ ->
          ‘attributes’). Note that attributes provided at this level
          may override other arguments (e.g. ‘plot_ly(x = 1:10, y =
          1:10, color = I("red"), marker = list(color = "blue"))’).

    type: A character string specifying the trace type (e.g.
          ‘"scatter"’, ‘"bar"’, ‘"box"’, etc). If specified, it
          _always_ creates a trace, otherwise

    name: Values mapped to the trace's name attribute. Since a trace
          can only have one name, this argument acts very much like
          ‘split’ in that it creates one trace for every unique value.

   color: Values mapped to relevant 'fill-color' attribute(s) (e.g.
          fillcolor, marker.color, textfont.color, etc.). The mapping
          from data values to color codes may be controlled using
          ‘colors’ and ‘alpha’, or avoided altogether via ‘I()’ (e.g.,
          ‘color = I("red")’). Any color understood by
          ‘grDevices::col2rgb()’ may be used in this way.

  colors: Either a colorbrewer2.org palette name (e.g. "YlOrRd" or
          "Blues"), or a vector of colors to interpolate in hexadecimal
          "#RRGGBB" format, or a color interpolation function like
          ‘colorRamp()’.

   alpha: A number between 0 and 1 specifying the alpha channel applied
          to ‘color’. Defaults to 0.5 when mapping to fillcolor and 1
          otherwise.

  stroke: Similar to ‘color’, but values are mapped to relevant
          'stroke-color' attribute(s) (e.g., marker.line.color and
          line.color for filled polygons). If not specified, ‘stroke’
          inherits from ‘color’.

 strokes: Similar to ‘colors’, but controls the ‘stroke’ mapping.

alpha_stroke: Similar to ‘alpha’, but applied to ‘stroke’.

    size: (Numeric) values mapped to relevant 'fill-size' attribute(s)
          (e.g., marker.size, textfont.size, and error_x.width). The
          mapping from data values to symbols may be controlled using
          ‘sizes’, or avoided altogether via ‘I()’ (e.g., ‘size =
          I(30)’).

   sizes: A numeric vector of length 2 used to scale ‘size’ to pixels.

    span: (Numeric) values mapped to relevant 'stroke-size'
          attribute(s) (e.g., marker.line.width, line.width for filled
          polygons, and error_x.thickness) The mapping from data values
          to symbols may be controlled using ‘spans’, or avoided
          altogether via ‘I()’ (e.g., ‘span = I(30)’).

   spans: A numeric vector of length 2 used to scale ‘span’ to pixels.

  symbol: (Discrete) values mapped to marker.symbol. The mapping from
          data values to symbols may be controlled using ‘symbols’, or
          avoided altogether via ‘I()’ (e.g., ‘symbol =
          I("pentagon")’). Any pch value or symbol name may be used in
          this way.

 symbols: A character vector of pch values or symbol names.

linetype: (Discrete) values mapped to line.dash. The mapping from data
          values to symbols may be controlled using ‘linetypes’, or
          avoided altogether via ‘I()’ (e.g., ‘linetype = I("dash")’).
          Any ‘lty’ (see par) value or dash name may be used in this
          way.

linetypes: A character vector of ‘lty’ values or dash names

   split: (Discrete) values used to create multiple traces (one trace
          per value).

   frame: (Discrete) values used to create animation frames.

   width: Width in pixels (optional, defaults to automatic sizing).

  height: Height in pixels (optional, defaults to automatic sizing).

  source: a character string of length 1. Match the value of this
          string with the source argument in ‘event_data()’ to retrieve
          the event data corresponding to a specific plot (shiny apps
          can have multiple plots).

_D_e_t_a_i_l_s:

     Unless ‘type’ is specified, this function just initiates a plotly
     object with 'global' attributes that are passed onto downstream
     uses of ‘add_trace()’ (or similar). A formula must always be used
     when referencing column name(s) in ‘data’ (e.g. ‘plot_ly(mtcars, x
     = ~wt)’). Formulas are optional when supplying values directly,
     but they do help inform default axis/scale titles (e.g.,
     ‘plot_ly(x = mtcars$wt)’ vs ‘plot_ly(x = ~mtcars$wt)’)

_A_u_t_h_o_r(_s):

     Carson Sievert

_R_e_f_e_r_e_n_c_e_s:

     <https://plotly-r.com/overview.html>

_S_e_e _A_l_s_o:

        • For initializing a plotly-geo object: ‘plot_geo()’

        • For initializing a plotly-mapbox object: ‘plot_mapbox()’

        • For translating a ggplot2 object to a plotly object:
          ‘ggplotly()’

        • For modifying any plotly object: ‘layout()’, ‘add_trace()’,
          ‘style()’

        • For linked brushing: ‘highlight()’

        • For arranging multiple plots: ‘subplot()’,
          ‘crosstalk::bscols()’

        • For inspecting plotly objects: ‘plotly_json()’

        • For quick, accurate, and searchable plotly.js reference:
          ‘schema()’

_E_x_a_m_p_l_e_s:

     ## Not run:
     
     
     # plot_ly() tries to create a sensible plot based on the information you 
     # give it. If you don't provide a trace type, plot_ly() will infer one.
     plot_ly(economics, x = ~pop)
     plot_ly(economics, x = ~date, y = ~pop)
     # plot_ly() doesn't require data frame(s), which allows one to take 
     # advantage of trace type(s) designed specifically for numeric matrices
     plot_ly(z = ~volcano)
     plot_ly(z = ~volcano, type = "surface")
     
     # plotly has a functional interface: every plotly function takes a plotly
     # object as it's first input argument and returns a modified plotly object
     add_lines(plot_ly(economics, x = ~date, y = ~unemploy/pop))
     
     # To make code more readable, plotly imports the pipe operator from magrittr
     economics %>% plot_ly(x = ~date, y = ~unemploy/pop) %>% add_lines()
     
     # Attributes defined via plot_ly() set 'global' attributes that 
     # are carried onto subsequent traces, but those may be over-written
     plot_ly(economics, x = ~date, color = I("black")) %>%
      add_lines(y = ~uempmed) %>%
      add_lines(y = ~psavert, color = I("red"))
     
     # Attributes are documented in the figure reference -> https://plotly.com/r/reference
     # You might notice plot_ly() has named arguments that aren't in this figure
     # reference. These arguments make it easier to map abstract data values to
     # visual attributes.
     p <- plot_ly(palmerpenguins::penguins, x = ~bill_length_mm, y = ~body_mass_g)
     add_markers(p, color = ~bill_depth_mm, size = ~bill_depth_mm)
     add_markers(p, color = ~species)
     add_markers(p, color = ~species, colors = "Set1")
     add_markers(p, symbol = ~species)
     add_paths(p, linetype = ~species)
     ## End(Not run)
     
In [21]:
college_test <- (outer(gpa_test,iq_test,LEVEL=1,FUN=salary_model))
college_test
A matrix: 10 × 10 of type dbl
95.08000 96.84889 98.61778100.3867102.1556103.9244105.6933107.4622109.2311111.0000
99.52889101.39605103.26321105.1304106.9975108.8647110.7319112.5990114.4662116.3333
103.97778105.94321107.90864109.8741111.8395113.8049115.7704117.7358119.7012121.6667
108.42667110.49037112.55407114.6178116.6815118.7452120.8089122.8726124.9363127.0000
112.87556115.03753117.19951119.3615121.5235123.6854125.8474128.0094130.1714132.3333
117.32444119.58469121.84494124.1052126.3654128.6257130.8859133.1462135.4064137.6667
121.77333124.13185126.49037128.8489131.2074133.5659135.9244138.2830140.6415143.0000
126.22222128.67901131.13580133.5926136.0494138.5062140.9630143.4198145.8765148.3333
130.67111133.22617135.78123138.3363140.8914143.4464146.0015148.5565151.1116153.6667
135.12000137.77333140.42667143.0800145.7333148.3867151.0400153.6933156.3467159.0000
In [23]:
# a 
# 固定IQ和GPA的取值,然后考虑预测变量的取值对于响应变量的影响 
 
library(plotly)   #用于构建交互式的图表 
salary_model <- function(GPA,IQ,LEVEL){
  50+20*GPA+0.07*IQ+35*LEVEL+0.01*GPA*IQ-10*GPA*LEVEL
}
gpa_test <- seq(1,5,length=10)  # 人工构建从1到5,长度为10的等间隔的IQ值
iq_test <- seq(1,200,length=10) # 人工构建从1到200,长度为10的等间隔的GPA值

college_test <- t(outer(gpa_test,iq_test,LEVEL=1,FUN=salary_model))  #从构建的gpa以及iq值中,预测出的salary值,LEVEL=1表示大学学历,对获取的各种gpaxiq值的组会
high_school_test <- t(outer(gpa_test,iq_test,LEVEL=0,FUN=salary_model)) #同上,注意此处需要进行转置,否则获得的结果在x轴以及y轴上是反过来的!!!

plot_ly(x=gpa_test,y=iq_test) |>
  add_surface(   # 添加一个 3D 表面图
    x=gpa_test,y=iq_test,
    z=~college_test, # 指定 z 轴的数据为 College 学生的薪水
    colorscale=list(c(0, 1), c("rgb(173,216,230)", "rgb(0,0,139)")),  # 指定颜色渐变,浅蓝色到深蓝色
    colorbar=list(title="college") # 指定颜色条的标题
) |> 
add_surface(    # 添加另一个 3D 表面图
  x=gpa_test,y=iq_test,
  z=~high_school_test,
  colorscale=list(c(0, 1), c("rgb(255,182,193)", "rgb(139,0,0)")),  # 浅红色到深红色
  colorbar=list(title="high school")
) |>
layout(    # 设置图表布局      
  scene=list(  # 指定 3D 场景的轴标签
    xaxis = list(title = "GPA"),
    yaxis = list(title = "IQ"),
    zaxis = list(title = "Salary")  
  )  
)                               

image-2.png

结论是和前面数学表达式中获取的一致的

image.png

我们以第1位作为测试,也就是gpa是1,然后iq是23.11111的情况,对于college也就是大学生来说薪资是96.84,然后我们查看上面的矩阵的值:

image-2.png

发现实际上是有问题的,所以得在outer之后的数据中进行转置,或者是在add_surface中输入x以及y的值的时候,反过来定义

image-3.png

我们此处进行转置,然后结果确实就验证正确了

image-4.png

In [ ]:
# b
salary_model(GPA = 4.0,IQ = 110,LEVEL = 1)  # 预测大学学历的薪水,是137.1
137.1
In [ ]:
# c
# 对于交叉项的效应,我们同样可以通过lm函数来进行计算
# 就像前面实践lab中的那样,summary(lm(medv ~ lstat * age, data = Boston)) ,我们需要检验这个交叉项的显著值,我们才能判定这个交叉项是否对于模型的预测有重要的影响
# 所以这个问题是false!

4,

image.png

a

首先依据课件ppt上的说法:

image.png

In [ ]:
# 越灵活。参数越多的模型,其在训练集上的表现会越好,但是在测试集上的表现可能会越差,这就是过拟合的问题
# 所以立方项的模型拟合会比线性模型拟合具有更低的训练集的RSS,立方项的拟合只需要在高于一次项的项上的系数归0,就可以得到一个线性模型,总之就是更加灵活
# 所以立方项的模型拟合会比线性模型拟合具有更低的训练集的RSS
In [ ]:
# b
# 测试集部分的RSS,实际上会呈现出U形曲线,所以无法确定

# c
# 还是立方项的拟合模型会导致更低的训练集的RSS,道理和前面一样,因为训练集上拟合的话,立方项的模型永远比线性模型拟合得更好

# d
# 无法回答,因为到底Y和X之间的非线性或者线性关系如何无法确定

9,

image.png

In [29]:
# a
library(ISLR2)
pairs(Auto) 
No description has been provided for this image
In [ ]:
# b

names(Auto)
# 'mpg''cylinders''displacement''horsepower''weight''acceleration''year''origin''name'
Auto %>% select(-name)%>%cor() # 去掉最后一列非数值变量列
  1. 'mpg'
  2. 'cylinders'
  3. 'displacement'
  4. 'horsepower'
  5. 'weight'
  6. 'acceleration'
  7. 'year'
  8. 'origin'
  9. 'name'
A matrix: 8 × 8 of type dbl
mpgcylindersdisplacementhorsepowerweightaccelerationyearorigin
mpg 1.0000000-0.7776175-0.8051269-0.7784268-0.8322442 0.4233285 0.5805410 0.5652088
cylinders-0.7776175 1.0000000 0.9508233 0.8429834 0.8975273-0.5046834-0.3456474-0.5689316
displacement-0.8051269 0.9508233 1.0000000 0.8972570 0.9329944-0.5438005-0.3698552-0.6145351
horsepower-0.7784268 0.8429834 0.8972570 1.0000000 0.8645377-0.6891955-0.4163615-0.4551715
weight-0.8322442 0.8975273 0.9329944 0.8645377 1.0000000-0.4168392-0.3091199-0.5850054
acceleration 0.4233285-0.5046834-0.5438005-0.6891955-0.4168392 1.0000000 0.2903161 0.2127458
year 0.5805410-0.3456474-0.3698552-0.4163615-0.3091199 0.2903161 1.0000000 0.1815277
origin 0.5652088-0.5689316-0.6145351-0.4551715-0.5850054 0.2127458 0.1815277 1.0000000
In [ ]:
# c
lm.fit <- lm(mpg ~ . -name,data = Auto)
summary(lm.fit)
Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,	Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
In [ ]:
# 从上面我们可以很清楚地看到响应变量和预测变量之间的关系,从p值以及系数的估计值的正负综合来判断

# 有统计学显著性的是displacement、weight、year 、origin

# year的系数是正的,表明mpg平均每年增加0.750773(单位)
In [ ]:
# d
par(mfrow = c(2, 2))
plot(lm.fit) 
No description has been provided for this image

残差与拟合值图

残差在零线周围随机分布,这是一个好兆头。没有明显的模式或曲线表明关系中存在非线性。然而,有几个残差相对较大的点(正负都有),可能值得进一步调查。

残差的 Q-Q 图

Q-Q 图显示残差大致沿着对角线分布,表明它们大致呈正态分布。这是线性回归的许多统计检验的重要假设。在尾部有几个点稍微偏离直线,但这很常见,并不一定表明正态性严重违反。

尺度-位置图

尺度-位置图显示绝对标准化残差的平方根与拟合值的关系。红线大致平坦,表明残差的方差在拟合值范围内相对稳定。这被称为同方差性,是线性回归的另一个关键假设。在较高拟合值处方差有轻微增加,但并不明显。

残差与杠杆值图

此图有助于识别可能对回归模型产生不成比例影响的高杠杆点。大多数点落在典型范围内,但有几个点的杠杆值较高(大约在 0.15-0.20 之间)。这些点应进行检查,以确定它们是否对模型拟合产生过大影响。库克距离等高线(虚线)有助于识别对模型有较大影响的点;等高线内的点影响较小。

问题总结

拟合问题:总体来看,模型拟合良好,残差随机分布且大致呈正态分布。在较高拟合值处方差有轻微增加,但并不严重。

异常大的离群值:有几个残差 magnitude 较大的点,但没有特别异常的离群值。

高杠杆观测值:有几个观测值的杠杆值较高,但没有特别高。有必要检查这些点,以确保它们没有驱动模型结果。

In [39]:
# e

# 有7个预测变量,理论上来说,有C72 = 7!/(2! * 5!) = 21个两两相关的交叉项
# 加载必要的库
library(dplyr)

# 定义变量
variables <- c('cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin')

# 生成所有两两变量组合的交叉项
combinations <- combn(variables, 2, simplify = FALSE)

# 循环遍历每个组合,拟合线性模型并输出摘要
for (combo in combinations) {
  interaction_term <- paste(combo, collapse = ":")
  formula <- as.formula(paste("mpg ~ . -name +", interaction_term))
  lm.fit <- lm(formula, data = Auto)
  cat("\n\n### Interaction Term:", interaction_term, "\n")
  print(summary(lm.fit))
}
Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



### Interaction Term: cylinders:displacement 

Call:
lm(formula = formula, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.6081  -1.7833  -0.0465   1.6821  12.2617 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -2.7096590  4.6858582  -0.578 0.563426    
cylinders              -2.6962123  0.4094916  -6.584 1.51e-10 ***
displacement           -0.0774797  0.0141535  -5.474 7.96e-08 ***
horsepower             -0.0476026  0.0133736  -3.559 0.000418 ***
weight                 -0.0052339  0.0006253  -8.370 1.10e-15 ***
acceleration            0.0597997  0.0918038   0.651 0.515188    
year                    0.7594500  0.0473354  16.044  < 2e-16 ***
origin                  0.7087399  0.2736917   2.590 0.009976 ** 
cylinders:displacement  0.0136081  0.0017209   7.907 2.84e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.089 on 383 degrees of freedom
Multiple R-squared:  0.8465,	Adjusted R-squared:  0.8433 
F-statistic: 264.1 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: cylinders:horsepower 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.2399 -1.6871 -0.0511  1.2858 11.9380 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)          11.7025260  4.9115648   2.383 0.017676 *  
cylinders            -4.3060695  0.4580950  -9.400  < 2e-16 ***
displacement         -0.0013925  0.0069110  -0.201 0.840426    
horsepower           -0.3156601  0.0306339 -10.304  < 2e-16 ***
weight               -0.0038948  0.0006231  -6.250 1.09e-09 ***
acceleration         -0.1703028  0.0901427  -1.889 0.059612 .  
year                  0.7393193  0.0448736  16.476  < 2e-16 ***
origin                0.9031644  0.2496880   3.617 0.000338 ***
cylinders:horsepower  0.0402008  0.0037856  10.619  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.929 on 383 degrees of freedom
Multiple R-squared:  0.8621,	Adjusted R-squared:  0.8592 
F-statistic: 299.3 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: cylinders:weight 

Call:
lm(formula = formula, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.9484  -1.7133  -0.1809   1.4530  12.4137 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.3143478  5.0076737   1.461  0.14494    
cylinders        -5.0347425  0.5795767  -8.687  < 2e-16 ***
displacement      0.0156444  0.0068409   2.287  0.02275 *  
horsepower       -0.0314213  0.0126216  -2.489  0.01322 *  
weight           -0.0150329  0.0011125 -13.513  < 2e-16 ***
acceleration      0.1006438  0.0897944   1.121  0.26306    
year              0.7813453  0.0464139  16.834  < 2e-16 ***
origin            0.8030154  0.2617333   3.068  0.00231 ** 
cylinders:weight  0.0015058  0.0001657   9.088  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.022 on 383 degrees of freedom
Multiple R-squared:  0.8531,	Adjusted R-squared:  0.8501 
F-statistic: 278.1 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: cylinders:acceleration 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6499 -2.0696 -0.1522  1.7965 12.5285 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)            -3.424e+01  5.741e+00  -5.964 5.60e-09 ***
cylinders               2.940e+00  7.810e-01   3.764 0.000193 ***
displacement            8.611e-03  7.677e-03   1.122 0.262667    
horsepower             -3.805e-02  1.411e-02  -2.697 0.007308 ** 
weight                 -5.238e-03  6.844e-04  -7.653 1.61e-13 ***
acceleration            1.097e+00  2.324e-01   4.719 3.33e-06 ***
year                    7.654e-01  4.966e-02  15.412  < 2e-16 ***
origin                  1.269e+00  2.725e-01   4.657 4.44e-06 ***
cylinders:acceleration -2.141e-01  4.458e-02  -4.802 2.25e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.236 on 383 degrees of freedom
Multiple R-squared:  0.8316,	Adjusted R-squared:  0.8281 
F-statistic: 236.4 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: cylinders:year 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.3346 -2.0533 -0.1177  1.6257 12.0879 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -9.290e+01  1.256e+01  -7.399 8.75e-13 ***
cylinders       1.469e+01  2.378e+00   6.178 1.66e-09 ***
displacement    1.175e-02  7.259e-03   1.619   0.1063    
horsepower     -2.765e-02  1.322e-02  -2.092   0.0371 *  
weight         -6.242e-03  6.212e-04 -10.049  < 2e-16 ***
acceleration    1.213e-01  9.423e-02   1.287   0.1988    
year            1.734e+00  1.601e-01  10.826  < 2e-16 ***
origin          1.375e+00  2.647e-01   5.197 3.30e-07 ***
cylinders:year -1.949e-01  3.027e-02  -6.439 3.60e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.165 on 383 degrees of freedom
Multiple R-squared:  0.8389,	Adjusted R-squared:  0.8356 
F-statistic: 249.3 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: cylinders:origin 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.7908 -2.1679 -0.1234  1.9043 13.0306 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -1.792e+01  5.169e+00  -3.466 0.000587 ***
cylinders        -3.572e-01  5.459e-01  -0.654 0.513327    
displacement      1.919e-02  7.865e-03   2.439 0.015169 *  
horsepower       -1.656e-02  1.386e-02  -1.195 0.232926    
weight           -6.462e-03  6.541e-04  -9.879  < 2e-16 ***
acceleration      8.068e-02  9.896e-02   0.815 0.415449    
year              7.527e-01  5.141e-02  14.640  < 2e-16 ***
origin            1.838e+00  1.358e+00   1.353 0.176836    
cylinders:origin -9.985e-02  3.223e-01  -0.310 0.756898    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.332 on 383 degrees of freedom
Multiple R-squared:  0.8215,	Adjusted R-squared:  0.8178 
F-statistic: 220.4 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: displacement:horsepower 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7010 -1.6009 -0.0967  1.4119 12.6734 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -1.894e+00  4.302e+00  -0.440  0.66007    
cylinders                6.466e-01  3.017e-01   2.143  0.03275 *  
displacement            -7.487e-02  1.092e-02  -6.859 2.80e-11 ***
horsepower              -1.975e-01  2.052e-02  -9.624  < 2e-16 ***
weight                  -3.147e-03  6.475e-04  -4.861 1.71e-06 ***
acceleration            -2.131e-01  9.062e-02  -2.351  0.01921 *  
year                     7.379e-01  4.463e-02  16.534  < 2e-16 ***
origin                   6.891e-01  2.527e-01   2.727  0.00668 ** 
displacement:horsepower  5.236e-04  4.813e-05  10.878  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.912 on 383 degrees of freedom
Multiple R-squared:  0.8636,	Adjusted R-squared:  0.8608 
F-statistic: 303.1 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: displacement:weight 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.9027 -1.8092 -0.0946  1.5549 12.1687 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -5.389e+00  4.301e+00  -1.253   0.2109    
cylinders            1.175e-01  2.943e-01   0.399   0.6899    
displacement        -6.837e-02  1.104e-02  -6.193 1.52e-09 ***
horsepower          -3.280e-02  1.238e-02  -2.649   0.0084 ** 
weight              -1.064e-02  7.136e-04 -14.915  < 2e-16 ***
acceleration         6.724e-02  8.805e-02   0.764   0.4455    
year                 7.852e-01  4.553e-02  17.246  < 2e-16 ***
origin               5.610e-01  2.622e-01   2.139   0.0331 *  
displacement:weight  2.269e-05  2.257e-06  10.054  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.964 on 383 degrees of freedom
Multiple R-squared:  0.8588,	Adjusted R-squared:  0.8558 
F-statistic: 291.1 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: displacement:acceleration 

Call:
lm(formula = formula, data = Auto)

Residuals:
   Min     1Q Median     3Q    Max 
-8.129 -1.899 -0.135  1.755 12.119 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)               -3.005e+01  4.737e+00  -6.343 6.36e-10 ***
cylinders                  2.136e-03  3.125e-01   0.007 0.994550    
displacement               7.022e-02  1.005e-02   6.989 1.24e-11 ***
horsepower                -5.515e-02  1.407e-02  -3.920 0.000105 ***
weight                    -4.211e-03  6.929e-04  -6.077 2.96e-09 ***
acceleration               7.530e-01  1.332e-01   5.653 3.09e-08 ***
year                       7.722e-01  4.811e-02  16.051  < 2e-16 ***
origin                     1.057e+00  2.671e-01   3.958 9.01e-05 ***
displacement:acceleration -4.855e-03  6.879e-04  -7.058 7.99e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.134 on 383 degrees of freedom
Multiple R-squared:  0.842,	Adjusted R-squared:  0.8387 
F-statistic: 255.2 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: displacement:year 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9549 -1.9978 -0.0345  1.6232 12.2793 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -6.241e+01  7.811e+00  -7.990 1.60e-14 ***
cylinders          1.200e-02  3.133e-01   0.038   0.9695    
displacement       2.710e-01  3.663e-02   7.399 8.78e-13 ***
horsepower        -3.201e-02  1.318e-02  -2.429   0.0156 *  
weight            -5.860e-03  6.211e-04  -9.435  < 2e-16 ***
acceleration       1.059e-01  9.328e-02   1.136   0.2568    
year               1.334e+00  9.636e-02  13.848  < 2e-16 ***
origin             1.230e+00  2.638e-01   4.662 4.34e-06 ***
displacement:year -3.491e-03  4.997e-04  -6.988 1.25e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.138 on 383 degrees of freedom
Multiple R-squared:  0.8417,	Adjusted R-squared:  0.8384 
F-statistic: 254.5 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: displacement:origin 

Call:
lm(formula = formula, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2312  -2.0927  -0.1295   1.7287  12.2761 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -2.140e+01  4.827e+00  -4.434 1.21e-05 ***
cylinders           -4.731e-01  3.204e-01  -1.477 0.140577    
displacement         3.802e-02  9.768e-03   3.892 0.000117 ***
horsepower          -1.523e-02  1.367e-02  -1.114 0.265920    
weight              -5.940e-03  6.723e-04  -8.835  < 2e-16 ***
acceleration         4.578e-02  9.868e-02   0.464 0.642939    
year                 7.682e-01  5.087e-02  15.103  < 2e-16 ***
origin               3.810e+00  8.762e-01   4.349 1.76e-05 ***
displacement:origin -2.262e-02  7.891e-03  -2.866 0.004383 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.297 on 383 degrees of freedom
Multiple R-squared:  0.8252,	Adjusted R-squared:  0.8216 
F-statistic: 226.1 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: horsepower:weight 

Call:
lm(formula = formula, data = Auto)

Residuals:
   Min     1Q Median     3Q    Max 
-8.589 -1.617 -0.184  1.541 12.001 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.876e+00  4.511e+00   0.638 0.524147    
cylinders         -2.955e-02  2.881e-01  -0.103 0.918363    
displacement       5.950e-03  6.750e-03   0.881 0.378610    
horsepower        -2.313e-01  2.363e-02  -9.791  < 2e-16 ***
weight            -1.121e-02  7.285e-04 -15.393  < 2e-16 ***
acceleration      -9.019e-02  8.855e-02  -1.019 0.309081    
year               7.695e-01  4.494e-02  17.124  < 2e-16 ***
origin             8.344e-01  2.513e-01   3.320 0.000986 ***
horsepower:weight  5.529e-05  5.227e-06  10.577  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.931 on 383 degrees of freedom
Multiple R-squared:  0.8618,	Adjusted R-squared:  0.859 
F-statistic: 298.6 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: horsepower:acceleration 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.0329 -1.8177 -0.1183  1.7247 12.4870 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -32.499820   4.923380  -6.601 1.36e-10 ***
cylinders                 0.083489   0.316913   0.263 0.792350    
displacement             -0.007649   0.008161  -0.937 0.349244    
horsepower                0.127188   0.024746   5.140 4.40e-07 ***
weight                   -0.003976   0.000716  -5.552 5.27e-08 ***
acceleration              0.983282   0.161513   6.088 2.78e-09 ***
year                      0.755919   0.048179  15.690  < 2e-16 ***
origin                    1.035733   0.268962   3.851 0.000138 ***
horsepower:acceleration  -0.012139   0.001772  -6.851 2.93e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.145 on 383 degrees of freedom
Multiple R-squared:  0.841,	Adjusted R-squared:  0.8376 
F-statistic: 253.2 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: horsepower:year 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6314 -1.8896 -0.0724  1.5781 11.6140 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -9.723e+01  1.008e+01  -9.648  < 2e-16 ***
cylinders        8.146e-02  3.027e-01   0.269    0.788    
displacement     5.183e-03  7.072e-03   0.733    0.464    
horsepower       8.070e-01  9.496e-02   8.498 4.33e-16 ***
weight          -5.288e-03  6.112e-04  -8.651  < 2e-16 ***
acceleration    -2.894e-02  9.121e-02  -0.317    0.751    
year             1.830e+00  1.318e-01  13.887  < 2e-16 ***
origin           1.178e+00  2.558e-01   4.604 5.66e-06 ***
horsepower:year -1.140e-02  1.302e-03  -8.754  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.042 on 383 degrees of freedom
Multiple R-squared:  0.8512,	Adjusted R-squared:  0.8481 
F-statistic:   274 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: horsepower:origin 

Call:
lm(formula = formula, data = Auto)

Residuals:
   Min     1Q Median     3Q    Max 
-9.277 -1.875 -0.225  1.570 12.080 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -2.196e+01  4.396e+00  -4.996 8.94e-07 ***
cylinders         -5.275e-01  3.028e-01  -1.742   0.0823 .  
displacement      -1.486e-03  7.607e-03  -0.195   0.8452    
horsepower         8.173e-02  1.856e-02   4.404 1.38e-05 ***
weight            -4.710e-03  6.555e-04  -7.186 3.52e-12 ***
acceleration      -1.124e-01  9.617e-02  -1.168   0.2434    
year               7.327e-01  4.780e-02  15.328  < 2e-16 ***
origin             7.695e+00  8.858e-01   8.687  < 2e-16 ***
horsepower:origin -7.955e-02  1.074e-02  -7.405 8.44e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.116 on 383 degrees of freedom
Multiple R-squared:  0.8438,	Adjusted R-squared:  0.8406 
F-statistic: 258.7 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: weight:acceleration 

Call:
lm(formula = formula, data = Auto)

Residuals:
   Min     1Q Median     3Q    Max 
-8.247 -2.048 -0.045  1.619 12.193 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -4.364e+01  5.811e+00  -7.511 4.18e-13 ***
cylinders           -2.141e-01  3.078e-01  -0.696 0.487117    
displacement         3.138e-03  7.495e-03   0.419 0.675622    
horsepower          -4.141e-02  1.348e-02  -3.071 0.002287 ** 
weight               4.027e-03  1.636e-03   2.462 0.014268 *  
acceleration         1.629e+00  2.422e-01   6.726 6.36e-11 ***
year                 7.821e-01  4.833e-02  16.184  < 2e-16 ***
origin               1.033e+00  2.686e-01   3.846 0.000141 ***
weight:acceleration -5.826e-04  8.408e-05  -6.928 1.81e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.141 on 383 degrees of freedom
Multiple R-squared:  0.8414,	Adjusted R-squared:  0.838 
F-statistic: 253.9 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: weight:year 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9995 -1.8495 -0.1559  1.6061 11.7042 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.186e+02  1.338e+01  -8.864  < 2e-16 ***
cylinders    -1.218e-01  3.032e-01  -0.402   0.6881    
displacement  1.293e-02  7.019e-03   1.842   0.0663 .  
horsepower   -2.877e-02  1.286e-02  -2.236   0.0259 *  
weight        3.044e-02  4.652e-03   6.543 1.94e-10 ***
acceleration  1.447e-01  9.196e-02   1.574   0.1164    
year          2.084e+00  1.732e-01  12.033  < 2e-16 ***
origin        1.174e+00  2.597e-01   4.519 8.30e-06 ***
weight:year  -4.879e-04  6.097e-05  -8.002 1.47e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.084 on 383 degrees of freedom
Multiple R-squared:  0.847,	Adjusted R-squared:  0.8439 
F-statistic: 265.1 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: weight:origin 

Call:
lm(formula = formula, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.6056  -2.0253  -0.1315   1.6295  12.5217 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -2.482e+01  4.891e+00  -5.075 6.06e-07 ***
cylinders     -5.799e-01  3.171e-01  -1.829   0.0682 .  
displacement   8.833e-03  7.810e-03   1.131   0.2588    
horsepower    -1.068e-02  1.358e-02  -0.787   0.4321    
weight        -2.782e-03  1.084e-03  -2.566   0.0107 *  
acceleration   3.766e-02  9.729e-02   0.387   0.6989    
year           7.632e-01  4.998e-02  15.270  < 2e-16 ***
origin         6.633e+00  1.265e+00   5.244 2.60e-07 ***
weight:origin -2.309e-03  5.477e-04  -4.215 3.11e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.257 on 383 degrees of freedom
Multiple R-squared:  0.8294,	Adjusted R-squared:  0.8258 
F-statistic: 232.7 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: acceleration:year 

Call:
lm(formula = formula, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.2277  -2.1428  -0.0839   1.8216  12.2997 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       97.6199866 20.8291407   4.687 3.86e-06 ***
cylinders         -0.2163880  0.3148719  -0.687  0.49236    
displacement       0.0087776  0.0074936   1.171  0.24219    
horsepower        -0.0269212  0.0133812  -2.012  0.04493 *  
weight            -0.0058738  0.0006363  -9.232  < 2e-16 ***
acceleration      -7.1818103  1.2900806  -5.567 4.88e-08 ***
year              -0.7461986  0.2696927  -2.767  0.00593 ** 
origin             1.2630333  0.2691459   4.693 3.76e-06 ***
acceleration:year  0.0945502  0.0167501   5.645 3.22e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.202 on 383 degrees of freedom
Multiple R-squared:  0.8352,	Adjusted R-squared:  0.8317 
F-statistic: 242.6 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: acceleration:origin 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.4106 -1.8805 -0.2471  1.7891 11.9680 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -0.3327273  5.0570077  -0.066   0.9476    
cylinders           -0.5881258  0.3063127  -1.920   0.0556 .  
displacement         0.0086251  0.0073062   1.181   0.2385    
horsepower          -0.0250843  0.0131049  -1.914   0.0564 .  
weight              -0.0052351  0.0006439  -8.131 5.98e-15 ***
acceleration        -1.0340600  0.1896960  -5.451 8.98e-08 ***
year                 0.7623813  0.0482774  15.792  < 2e-16 ***
origin              -9.3089774  1.6109675  -5.779 1.56e-08 ***
acceleration:origin  0.6546959  0.0969263   6.755 5.34e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.15 on 383 degrees of freedom
Multiple R-squared:  0.8405,	Adjusted R-squared:  0.8371 
F-statistic: 252.2 on 8 and 383 DF,  p-value: < 2.2e-16



### Interaction Term: year:origin 

Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.6072 -2.0439 -0.0596  1.7121 12.3368 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.492e+00  9.044e+00   0.939 0.348353    
cylinders    -5.042e-01  3.192e-01  -1.579 0.115082    
displacement  1.567e-02  7.530e-03   2.081 0.038060 *  
horsepower   -1.399e-02  1.364e-02  -1.025 0.305786    
weight       -6.352e-03  6.449e-04  -9.851  < 2e-16 ***
acceleration  9.185e-02  9.766e-02   0.941 0.347546    
year          4.189e-01  1.125e-01   3.723 0.000226 ***
origin       -1.405e+01  4.699e+00  -2.989 0.002978 ** 
year:origin   1.989e-01  6.030e-02   3.298 0.001064 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.286 on 383 degrees of freedom
Multiple R-squared:  0.8264,	Adjusted R-squared:  0.8228 
F-statistic: 227.9 on 8 and 383 DF,  p-value: < 2.2e-16

In [ ]:
# 下面这种方法弃用!!!

# 定义变量
variables <- c('cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'year', 'origin')

# 生成所有两两变量组合的交叉项
combinations <- combn(variables, 2, simplify = FALSE)

# 创建包含所有交叉项的公式
interaction_terms <- sapply(combinations, function(combo) paste(combo, collapse = ":"))
formula <- as.formula(paste("mpg ~ . -name +", paste(interaction_terms, collapse = " + ")))

# 拟合线性模型并输出摘要
lm.fit <- lm(formula, data = Auto)
summary(lm.fit)
Call:
lm(formula = formula, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6303 -1.4481  0.0596  1.2739 11.1386 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)   
(Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
weight                     4.133e-03  1.759e-02   0.235  0.81442   
acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
year                       6.974e-01  6.097e-01   1.144  0.25340   
origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.695 on 363 degrees of freedom
Multiple R-squared:  0.8893,	Adjusted R-squared:  0.8808 
F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

image.png

In [43]:
# f

# 此处我们以horsepower为例,进行非线性变换拟合

par(mfrow = c(2, 2))
plot(Auto$horsepower, Auto$mpg)
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot(Auto$horsepower^2, Auto$mpg)

par(mfrow = c(2, 2))
plot(Auto$horsepower, Auto$mpg, cex = 0.2)
plot(log(Auto$horsepower), Auto$mpg, cex = 0.2)
plot(sqrt(Auto$horsepower), Auto$mpg, cex = 0.2)
plot(Auto$horsepower^2, Auto$mpg, cex = 0.2)
No description has been provided for this image
No description has been provided for this image
In [ ]:
# 此处我们仅以log转换为例

Auto %>% mutate(horsepower=log(horsepower))
fit.new <- lm(mpg ~ . -name,data = Auto)
summary(fit.new)
par(mfrow = c(2, 2))
plot(fit.new)

Auto <- Auto %>% mutate(horsepower=log(horsepower))
fit.new <- lm(mpg ~ . -name,data = Auto)
summary(fit.new)
par(mfrow = c(2, 2))
plot(fit.new)
A data.frame: 392 × 9
mpgcylindersdisplacementhorsepowerweightaccelerationyearoriginname
<dbl><int><dbl><dbl><int><dbl><int><int><fct>
11883074.867534350412.0701chevrolet chevelle malibu
21583505.105945369311.5701buick skylark 320
31883185.010635343611.0701plymouth satellite
41683045.010635343312.0701amc rebel sst
51783024.941642344910.5701ford torino
61584295.288267434110.0701ford galaxie 500
71484545.3936284354 9.0701chevrolet impala
81484405.3706384312 8.5701plymouth fury iii
91484555.416100442510.0701pontiac catalina
101583905.2470243850 8.5701amc ambassador dpl
111583835.135798356310.0701dodge challenger se
121483405.0751743609 8.0701plymouth 'cuda 340
131584005.0106353761 9.5701chevrolet monte carlo
141484555.416100308610.0701buick estate wagon (sw)
152441134.553877237215.0703toyota corona mark ii
162261984.553877283315.5701plymouth duster
171861994.574711277415.5701amc hornet
182162004.442651258716.0701ford maverick
19274 974.477337213014.5703datsun pl510
20264 973.828641183520.5702volkswagen 1131 deluxe sedan
212541104.465908267217.5702peugeot 504
222441074.499810243014.5702audi 100 ls
232541044.553877237517.5702saab 99e
242641214.727388223412.5702bmw 2002
252161994.499810264815.0701amc gremlin
261083605.370638461514.0701ford f250
271083075.298317437615.0701chevy c20
281183185.347108438213.5701dodge d200
29 983045.262690473218.5701hi 1200d
30274 974.477337213014.5713datsun pl510
⋮⋮⋮⋮⋮⋮⋮⋮⋮⋮
3682841124.477337260519.6821chevrolet cavalier
3692741124.477337264018.6821chevrolet cavalier wagon
3703441124.477337239518.0821chevrolet cavalier 2-door
3713141124.442651257516.2821pontiac j2000 se hatchback
3722941354.430817252516.0821dodge aries se
3732741514.499810273518.0821pontiac phoenix
3742441404.521789286516.4821ford fairmont futura
3753641054.304065198015.3822volkswagen rabbit l
376374 914.219508202518.2823mazda glc custom l
377314 914.219508197017.6823mazda glc custom
3783841054.143135212514.7821plymouth horizon miser
379364 984.248495212517.3821mercury lynx l
3803641204.477337216014.5823nissan stanza xe
3813641074.317488220514.5823honda accord
3823441084.248495224516.9823toyota corolla
383384 914.204693196515.0823honda civic
384324 914.204693196515.7823honda civic (auto)
385384 914.204693199516.2823datsun 310 gx
3862561814.700480294516.4821buick century limited
3873862624.442651301517.0821oldsmobile cutlass ciera (diesel)
3882641564.521789258514.5821chrysler lebaron medallion
3892262324.718499283514.7821ford granada l
3903241444.564348266513.9823toyota celica gt
3913641354.430817237013.0821dodge charger 2.2
3922741514.499810295017.3821chevrolet camaro
3932741404.454347279015.6821ford mustang gl
394444 973.951244213024.6822vw pickup
3953241354.430817229511.6821dodge rampage
3962841204.369448262518.6821ford ranger
3973141194.406719272019.4821chevy s-10
Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,	Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.3115 -2.0041 -0.1726  1.8393 12.6579 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.254005   8.589614   3.173  0.00163 ** 
cylinders    -0.486206   0.306692  -1.585  0.11372    
displacement  0.019456   0.006876   2.830  0.00491 ** 
horsepower   -9.506436   1.539619  -6.175 1.69e-09 ***
weight       -0.004266   0.000694  -6.148 1.97e-09 ***
acceleration -0.292088   0.103804  -2.814  0.00515 ** 
year          0.705329   0.048456  14.556  < 2e-16 ***
origin        1.482435   0.259347   5.716 2.19e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.18 on 384 degrees of freedom
Multiple R-squared:  0.837,	Adjusted R-squared:  0.834 
F-statistic: 281.6 on 7 and 384 DF,  p-value: < 2.2e-16
No description has been provided for this image
No description has been provided for this image
In [ ]:
# 可以发现在log转变之后horsepower似乎与mpg更呈现线性关系了,并且p值更加显著了

13,

image.png

In [47]:
set.seed(1)
In [48]:
# a
X <- rnorm(n = 100,mean = 0,sd = 1)
print(X)
  [1] -0.626453811  0.183643324 -0.835628612  1.595280802  0.329507772
  [6] -0.820468384  0.487429052  0.738324705  0.575781352 -0.305388387
 [11]  1.511781168  0.389843236 -0.621240581 -2.214699887  1.124930918
 [16] -0.044933609 -0.016190263  0.943836211  0.821221195  0.593901321
 [21]  0.918977372  0.782136301  0.074564983 -1.989351696  0.619825748
 [26] -0.056128740 -0.155795507 -1.470752384 -0.478150055  0.417941560
 [31]  1.358679552 -0.102787727  0.387671612 -0.053805041 -1.377059557
 [36] -0.414994563 -0.394289954 -0.059313397  1.100025372  0.763175748
 [41] -0.164523596 -0.253361680  0.696963375  0.556663199 -0.688755695
 [46] -0.707495157  0.364581962  0.768532925 -0.112346212  0.881107726
 [51]  0.398105880 -0.612026393  0.341119691 -1.129363096  1.433023702
 [56]  1.980399899 -0.367221476 -1.044134626  0.569719627 -0.135054604
 [61]  2.401617761 -0.039240003  0.689739362  0.028002159 -0.743273209
 [66]  0.188792300 -1.804958629  1.465554862  0.153253338  2.172611670
 [71]  0.475509529 -0.709946431  0.610726353 -0.934097632 -1.253633400
 [76]  0.291446236 -0.443291873  0.001105352  0.074341324 -0.589520946
 [81] -0.568668733 -0.135178615  1.178086997 -1.523566800  0.593946188
 [86]  0.332950371  1.063099837 -0.304183924  0.370018810  0.267098791
 [91] -0.542520031  1.207867806  1.160402616  0.700213650  1.586833455
 [96]  0.558486426 -1.276592208 -0.573265414 -1.224612615 -0.473400636
In [50]:
# b
eps <- rnorm(n = 100,mean = 0,sd = sqrt(0.25)) # 注意这里是sd = sqrt(0.25)而不是0.25
print(eps)
  [1]  0.204700920  0.844436643  0.793294217 -0.165453900 -1.142617768
  [6]  1.248830795  0.333533083  0.270663668 -0.006699762  0.255054211
 [11] -0.082187916  0.210347322 -0.200123372 -0.685103939  0.493919134
 [16]  0.759872513 -0.154370285 -0.626644878  0.321120653 -0.022354568
 [21] -0.866609203  0.001065930 -0.315150167 -0.170484290 -0.578286181
 [26]  0.901570954 -0.165566018 -0.802756706  0.098596719  0.131587823
 [31] -0.492913350 -1.444460336 -0.320240851  0.285253818 -0.029861638
 [36] -0.049089372  0.280410364 -0.593229319  0.548388522 -0.002672014
 [41]  0.353655334  0.517053867  0.111740207 -0.439353806  0.581482278
 [46] -1.000082472 -0.272395370 -0.127835355 -0.083060518  0.510231954
 [51]  0.068110947  0.203583802 -0.034827407 -0.123832171  0.347775403
 [56]  0.573114179 -1.201548107  0.286369778  0.187362203 -0.212633861
 [61]  0.475506404 -0.194618591 -0.142165331  0.428704889  0.859813650
 [66]  0.135027450 -0.211092005 -0.594556647 -0.165516489 -0.469914663
 [71] -0.129466292  0.197189584 -0.425928546  1.324583441  0.078005838
 [76]  0.565103634 -1.144561990  0.370500579 -0.658122580  0.459901839
 [81]  0.199065078 -0.203764290  0.662129315 -0.350615835 -0.290307152
 [86] -0.500536091 -0.334089303  0.472592477  0.216851075  0.502579609
 [91] -0.195059332  0.188185146  0.122082462 -0.713128671  0.889214644
 [96]  0.067223830  0.382799500  0.477568338 -0.025282851 -0.152907710
In [ ]:
# c
Y <- -1 + 0.5 * X + eps
print(Y)
length(Y)  # 向量长100
  [1] -1.10852599 -0.06374169 -0.62452009 -0.36781350 -1.97786388 -0.16140340
  [7] -0.42275239 -0.36017398 -0.71880909 -0.89763998 -0.32629733 -0.59473106
 [13] -1.51074366 -2.79245388  0.05638459 -0.26259429 -1.16246542 -1.15472677
 [19] -0.26826875 -0.72540391 -1.40712052 -0.60786592 -1.27786768 -2.16516014
 [25] -1.26837331 -0.12649342 -1.24346377 -2.53813290 -1.14047831 -0.65944140
 [31] -0.81357357 -2.49585420 -1.12640505 -0.74164870 -1.71839142 -1.25658665
 [37] -0.91673461 -1.62288602  0.09840121 -0.62108414 -0.72860646 -0.60962697
 [43] -0.53977810 -1.16102221 -0.76289557 -2.35383005 -1.09010439 -0.74356889
 [49] -1.13923362 -0.04921418 -0.73283611 -1.10242939 -0.86426756 -1.68851372
 [55]  0.06428725  0.56331413 -2.38515885 -1.23569754 -0.52777798 -1.28016116
 [61]  0.67631528 -1.21423859 -0.79729565 -0.55729403 -0.51182295 -0.77057640
 [67] -2.11357132 -0.86177922 -1.08888982 -0.38360883 -0.89171153 -1.15778363
 [73] -1.12056537 -0.14246538 -1.54881086 -0.28917325 -2.36620793 -0.62894675
 [79] -1.62095192 -0.83485863 -1.08526929 -1.27135360  0.25117281 -2.11239923
 [85] -0.99333406 -1.33406090 -0.80253938 -0.67949949 -0.59813952 -0.36387100
 [91] -1.46631935 -0.20788095 -0.29771623 -1.36302185  0.68263137 -0.65353296
 [97] -1.25549660 -0.80906437 -1.63758916 -1.38960803
100
In [ ]:
# 我们开始拟合模型
lm.fit <- lm(Y ~ X)
summary(lm.fit)

# (Intercept) -0.98632,对应的就是截距的估计值,X            0.51058,对应的就是斜率的估计值
Call:
lm(formula = Y ~ X)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.45706 -0.24115 -0.02266  0.32462  1.32079 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.98632    0.05235 -18.840  < 2e-16 ***
X            0.51058    0.05815   8.781 5.34e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5197 on 98 degrees of freedom
Multiple R-squared:  0.4403,	Adjusted R-squared:  0.4346 
F-statistic:  77.1 on 1 and 98 DF,  p-value: 5.336e-14
In [ ]:
# d

plot(X,Y)

# 从图上看起来,似乎是线性的
No description has been provided for this image
In [59]:
# e

# 我们开始拟合模型
lm.fit <- lm(Y ~ X)
summary(lm.fit)

# 看sd的话,两个系数拟合的挺好的
# 模型的系数估计值都非常显著,标准误较小,说明模型拟合良好。
# 截距项和 X 的系数都显著不为零,说明 X 对 y 有显著的线性影响。
Call:
lm(formula = Y ~ X)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.45706 -0.24115 -0.02266  0.32462  1.32079 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.98632    0.05235 -18.840  < 2e-16 ***
X            0.51058    0.05815   8.781 5.34e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5197 on 98 degrees of freedom
Multiple R-squared:  0.4403,	Adjusted R-squared:  0.4346 
F-statistic:  77.1 on 1 and 98 DF,  p-value: 5.336e-14

image.png

In [66]:
# f
plot(X, Y)
abline(lm.fit,col="red")

abline(-0.98632, 0.51058, col = "red", lty = 2)

legend("topleft",
  c("model fit", "population regression"),
  col = c("black", "red"),
  lty = c(1, 2)
)
No description has been provided for this image
In [ ]:
# g

lm.fit <- lm(Y ~ X)
summary(lm.fit)

fit.2 <- lm(Y ~ X + I(X^2))
summary(fit.2)

# 或者
fit.22 <- lm(Y ~ poly(X, 2,raw = TRUE))
summary(fit.22)

anova(fit.2,lm.fit)
anova(fit.22,lm.fit)

# F检验都不显著,说明模型的拟合效果并没有提升
Call:
lm(formula = Y ~ X)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.45706 -0.24115 -0.02266  0.32462  1.32079 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.98632    0.05235 -18.840  < 2e-16 ***
X            0.51058    0.05815   8.781 5.34e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5197 on 98 degrees of freedom
Multiple R-squared:  0.4403,	Adjusted R-squared:  0.4346 
F-statistic:  77.1 on 1 and 98 DF,  p-value: 5.336e-14
Call:
lm(formula = Y ~ X + I(X^2))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.46371 -0.24761 -0.01792  0.32978  1.32271 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.979434   0.064138 -15.271  < 2e-16 ***
X            0.511912   0.058865   8.696 8.71e-14 ***
I(X^2)      -0.008668   0.046208  -0.188    0.852    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5223 on 97 degrees of freedom
Multiple R-squared:  0.4405,	Adjusted R-squared:  0.429 
F-statistic: 38.19 on 2 and 97 DF,  p-value: 5.856e-13
Call:
lm(formula = Y ~ poly(X, 2, raw = TRUE))

Residuals:
     Min       1Q   Median       3Q      Max 
-1.46371 -0.24761 -0.01792  0.32978  1.32271 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             -0.979434   0.064138 -15.271  < 2e-16 ***
poly(X, 2, raw = TRUE)1  0.511912   0.058865   8.696 8.71e-14 ***
poly(X, 2, raw = TRUE)2 -0.008668   0.046208  -0.188    0.852    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5223 on 97 degrees of freedom
Multiple R-squared:  0.4405,	Adjusted R-squared:  0.429 
F-statistic: 38.19 on 2 and 97 DF,  p-value: 5.856e-13
A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
19726.45666NA NA NA NA
29826.46626-1-0.0095982430.035190750.8515884
A anova: 2 × 6
Res.DfRSSDfSum of SqFPr(>F)
<dbl><dbl><dbl><dbl><dbl><dbl>
19726.45666NA NA NA NA
29826.46626-1-0.0095982430.035190750.8515884
In [ ]:
# h

# 比如说我们拟合方差为0.05的误差的模型

x <- rnorm(100, 0, 1)
y <- -1 + 0.5 * x + rnorm(100, 0, sqrt(0.05))
fit2 <- lm(y ~ x)
summary(fit2)
plot(x, y)
abline(fit2)
abline(-1, 0.5, col = "red", lty = 2)
legend("topleft",
  c("model fit", "population regression"),
  col = c("black", "red"),
  lty = c(1, 2)
)

# 可以看出,数据变异性减小,R2变高了
Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6547 -0.1600 -0.0127  0.1452  0.8618 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.00908    0.02631  -38.35   <2e-16 ***
x            0.50630    0.02663   19.01   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2628 on 98 degrees of freedom
Multiple R-squared:  0.7867,	Adjusted R-squared:  0.7845 
F-statistic: 361.4 on 1 and 98 DF,  p-value: < 2.2e-16
No description has been provided for this image
In [ ]:
# i

# 噪声越高,我们就拟合1

x <- rnorm(100, 0, 1)
y <- -1 + 0.5 * x + rnorm(100, 0, 1)
fit3 <- lm(y ~ x)
summary(fit3)
plot(x, y)
abline(fit3)
abline(-1, 0.5, col = "red", lty = 2)
legend("topleft",
  c("model fit", "population regression"),
  col = c("black", "red"),
  lty = c(1, 2)
)

# 可以看出,数据变异性增大,R2变低了
Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.73419 -0.80105 -0.05787  0.74623  2.87841 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -1.1992     0.1087 -11.027  < 2e-16 ***
x             0.5086     0.1130   4.501 1.87e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.086 on 98 degrees of freedom
Multiple R-squared:  0.1713,	Adjusted R-squared:  0.1628 
F-statistic: 20.26 on 1 and 98 DF,  p-value: 1.866e-05
No description has been provided for this image
In [ ]:
# j
confint(lm.fit)
confint(fit2)
confint(fit3)

# 误差越小,置信区间越小
A matrix: 2 × 2 of type dbl
2.5 %97.5 %
(Intercept)-1.0902064-0.8824249
X 0.3951885 0.6259784
A matrix: 2 × 2 of type dbl
2.5 %97.5 %
(Intercept)-1.0612967-0.9568549
x 0.4534481 0.5591528
A matrix: 2 × 2 of type dbl
2.5 %97.5 %
(Intercept)-1.4150018-0.9833927
x 0.2843477 0.7328345